Status: New
Owner: ----

New issue 1113 by [email protected]: ganeti-luxid fills the log storage with millions lines like "INFO Successfully handled WaitForJobChange"
https://code.google.com/p/ganeti/issues/detail?id=1113

Hi,

Our storage log partition got filled this night with /var/log/luxi-daemon.log growing to several gigabytes withing a few minutes:

2015-06-29 01:20:06,651512000000 CEST: ganeti-luxid pid=34084 INFO Scheduler notified of change of job 707033 2015-06-29 01:20:06,651603000000 CEST: ganeti-luxid pid=34084 INFO Rereading job 707033 2015-06-29 01:20:06,753828000000 CEST: ganeti-luxid pid=34084 INFO Scheduler notified of change of job 707033 2015-06-29 01:20:06,753968000000 CEST: ganeti-luxid pid=34084 INFO Rereading job 707033 2015-06-29 01:20:06,859385000000 CEST: ganeti-luxid pid=34084 INFO Scheduler notified of change of job 707033 2015-06-29 01:20:06,859552000000 CEST: ganeti-luxid pid=34084 INFO Rereading job 707033 2015-06-29 01:20:08,159165000000 CEST: ganeti-luxid pid=34084 INFO Scheduler notified of change of job 707033 2015-06-29 01:20:08,159408000000 CEST: ganeti-luxid pid=34084 INFO Rereading job 707033 2015-06-29 01:20:08,159843000000 CEST: ganeti-luxid pid=34084 INFO Finished jobs: (707033,JOB_STATUS_SUCCESS)

and then it looped over:

2015-06-29 01:20:27,325809000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,326939000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,327972000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,328843000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,329800000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,330766000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,331682000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,332570000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,333451000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,334242000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,335094000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,336053000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,336965000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,337754000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,338597000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,339507000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,340289000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,341122000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,341967000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,342745000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,343578000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,344423000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,345192000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,346008000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,346811000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,347679000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,348420000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,349222000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,350065000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,350800000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,351627000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,352408000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,353123000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange 2015-06-29 01:20:27,353881000000 CEST: ganeti-luxid pid=34084 INFO Successfully handled WaitForJobChange
(...)

It looped very fast as you can see for the log timestamp and we had a total of 23 millions lines like this in the last 24 hours:

# grep -Fc 'INFO Successfully handled WaitForJobChange' luxi-daemon.log.1
23646165

It only stopped when the partition had no free storage left.

I've found a probable relevant part of the code (from 2.12.4):

def WaitForJobChange(self, job_id, fields, prev_job_info, prev_log_serial):
    job_id = Client._PrepareJobId(REQ_WAIT_FOR_JOB_CHANGE, job_id)
    while True:
      result = self.WaitForJobChangeOnce(job_id, fields,
                                         prev_job_info, prev_log_serial)
      if result != constants.JOB_NOTCHANGED:
        break
    return result

I don't know the internals but 'while True' looks dangerous to me. Maybe we need a sleep() inside the 'while True' loop to avoid a race condition?

I'm not quite sure yet but I think this got triggered because one of the nodes was running out of storage space before the 'cluster' got filled itself too.

More information about the cluster:

We are running Ganeti 2.11.6-1~bpo70+1 on Debian GNU/Linux 7.8 (wheezy) amd64.

# gnt-cluster --version
gnt-cluster (ganeti v2.11.6) 2.11.6

# gnt-cluster version
Software version: 2.11.6
Internode protocol: 2110000
Configuration format: 2110000
OS api version: 20
Export interface: 0
VCS version: (ganeti) version v2.11.6

# hspace --version
hspace (ganeti) version v2.11.6
compiled with ghc 7.4
running on linux x86_64

Please let me know if you need more information.

--
You received this message because this project is configured to send all issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

Reply via email to