Re: [ceph-users] large memory leak on scrubbing
Hi, > Is that the only slow request message you see? No. Full log: https://www.dropbox.com/s/i3ep5dcimndwvj1/slow_requests.txt.tar.gz It start from: 2013-08-16 09:43:39.662878 mon.0 10.174.81.132:6788/0 4276384 : [DBG] osd.4 10.174.81.131:6805/31460 reported failed by osd.50 10.174.81.135:6842/26019 2013-08-16 09:43:40.711911 mon.0 10.174.81.132:6788/0 4276386 : [DBG] osd.4 10.174.81.131:6805/31460 reported failed by osd.14 10.174.81.132:6836/2958 2013-08-16 09:43:41.043016 mon.0 10.174.81.132:6788/0 4276388 : [DBG] osd.4 10.174.81.131:6805/31460 reported failed by osd.13 10.174.81.132:6830/2482 2013-08-16 09:43:41.043047 mon.0 10.174.81.132:6788/0 4276389 : [INF] osd.4 10.174.81.131:6805/31460 failed (3 reports from 3 peers after 2013-08-16 09:43:56.042983 >= grace 20.00) 2013-08-16 09:43:41.122326 mon.0 10.174.81.132:6788/0 4276390 : [INF] osdmap e10294: 144 osds: 143 up, 143 in 2013-08-16 09:43:38.798833 osd.4 10.174.81.131:6805/31460 913 : [WRN] 6 slow requests, 6 included below; oldest blocked for > 30.190146 secs 2013-08-16 09:43:38.798843 osd.4 10.174.81.131:6805/31460 914 : [WRN] slow request 30.190146 seconds old, received at 2013-08-16 09:43:08.585504: osd_op(client.22301645.0:48987 .dir.1585245.1 [call rgw.bucket_complete_op] 16.33d5ea80) v4 currently waiting for subops from [25,133] 2013-08-16 09:43:38.798854 osd.4 10.174.81.131:6805/31460 915 : [WRN] slow request 30.189643 seconds old, received at 2013-08-16 09:43:08.586007: osd_op(client.22301855.0:49374 .dir.1585245.1 [call rgw.bucket_complete_op] 16.33d5ea80) v4 currently waiting for subops from [25,133] 2013-08-16 09:43:38.798859 osd.4 10.174.81.131:6805/31460 916 : [WRN] slow request 30.188236 seconds old, received at 2013-08-16 09:43:08.587414: osd_op(client.22307596.0:47674 .dir.1585245.1 [call rgw.bucket_complete_op] 16.33d5ea80) v4 currently waiting for subops from [25,133] 2013-08-16 09:43:38.798862 osd.4 10.174.81.131:6805/31460 917 : [WRN] slow request 30.187853 seconds old, received at 2013-08-16 09:43:08.587797: osd_op(client.22303894.0:51846 .dir.1585245.1 [call rgw.bucket_complete_op] 16.33d5ea80) v4 currently waiting for subops from [25,133] ... 2013-08-16 09:44:18.126318 mon.0 10.174.81.132:6788/0 4276427 : [INF] osd.4 10.174.81.131:6805/31460 boot ... 2013-08-16 09:44:23.215918 mon.0 10.174.81.132:6788/0 4276437 : [DBG] osd.25 10.174.81.133:6810/2961 reported failed by osd.83 10.174.81.137:6837/27963 2013-08-16 09:44:23.704769 mon.0 10.174.81.132:6788/0 4276438 : [INF] pgmap v17035051: 32424 pgs: 1 stale+active+clean+scrubbing+deep, 2 active, 31965 active+clean, 7 stale+active+clean, 29 peering, 415 active+degraded, 5 active+clean+scrubbing; 6630 GB data, 21420 GB used, 371 TB / 392 TB avail; 246065/61089697 degraded (0.403%) 2013-08-16 09:44:23.711244 mon.0 10.174.81.132:6788/0 4276439 : [DBG] osd.133 10.174.81.142:6803/21366 reported failed by osd.26 10.174.81.133:6814/3674 2013-08-16 09:44:23.713597 mon.0 10.174.81.132:6788/0 4276440 : [DBG] osd.133 10.174.81.142:6803/21366 reported failed by osd.17 10.174.81.132:6806/9188 2013-08-16 09:44:23.753952 mon.0 10.174.81.132:6788/0 4276441 : [DBG] osd.133 10.174.81.142:6803/21366 reported failed by osd.27 10.174.81.133:6822/5389 2013-08-16 09:44:23.753982 mon.0 10.174.81.132:6788/0 4276442 : [INF] osd.133 10.174.81.142:6803/21366 failed (3 reports from 3 peers after 2013-08-16 09:44:38.753913 >= grace 20.00) 2013-08-16 09:47:10.229099 mon.0 10.174.81.132:6788/0 4276646 : [INF] pgmap v17035216: 32424 pgs: 32424 active+clean; 6630 GB data, 21420 GB used, 371 TB / 392 TB avail; 0B/s rd, 622KB/s wr, 85op/s Why osd's are 'reported failed' on scrubbing? -- Regards Dominik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] large memory leak on scrubbing
On Mon, 19 Aug 2013, Mostowiec Dominik wrote: > Thanks for your response. > Great. > > In latest cuttlefish it is also fixed I think? > > We have two problems with scrubbing: > - memory leaks > - slow requests and wrongly mark osd with bucket index down (when scrubbing) The slow requests can trigger if you have very large objects (including a very large rgw bucket index object). But the message you quote below is for a scrub-reserve operation, which should really be excluded from the op warnings entirely. Is that the only slow request message you see? > Now we decided to turn off scrubbing and trigger it on maintenance window. > I noticed that "ceph osd scrub", or "ceph osd deep-scrub" trigger scrub on > osd but not for all PG. > It is possible to trigger scrubbing all PG on one osd? It should trigger a scrub on all PGs that are clean. If a PG is recovering it will be skipped. sage > > -- > Regards > Dominik > > > -Original Message- > From: Sage Weil [mailto:s...@inktank.com] > Sent: Saturday, August 17, 2013 5:11 PM > To: Mostowiec Dominik > Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com; Studzi?ski > Krzysztof; Sydor Bohdan > Subject: Re: [ceph-users] large memory leak on scrubbing > > Hi Dominic, > > There is a bug fixed a couple of months back that fixes excessive memory > consumption during scrub. You can upgrade to the latest 'bobtail' branch. > See > > http://ceph.com/docs/master/install/debian/#development-testing-packages > > Installing that package should clear this up. > > sage > > > On Fri, 16 Aug 2013, Mostowiec Dominik wrote: > > > Hi, > > We noticed some issues on CEPH/S3 cluster, I think it related with > > scrubbing: large memory leaks. > > > > Logs 09.xx: > > https://www.dropbox.com/s/4z1fzg239j43igs/ceph-osd.4.log_09xx.tar.gz > > >From 09.30 to 09.44 (14 minutes) osd.4 proces grows up to 28G. > > > > I think this is something curious: > > 2013-08-16 09:43:48.801331 7f6570d2e700 0 log [WRN] : slow request > > 32.794125 seconds old, received at 2013-08-16 09:43:16.007104: > > osd_sub_op(unknown.0.0:0 16.113d 0//0//-1 [scrub-reserve] v 0'0 > > snapset=0=[]:[] snapc=0=[]) v7 currently no flag points reached > > > > We have large rgw index and a lot of large files than on this cluster. > > ceph version 0.56.6 (95a0bda7f007a33b0dc7adf4b330778fa1e5d70c) > > Setup: > > - 12 servers x 12 OSD > > - 3 mons > > Default scrubbing settings. > > Journal and filestore settings: > > journal aio = true > > filestore flush min = 0 > > filestore flusher = false > > filestore fiemap = false > > filestore op threads = 4 > > filestore queue max ops = 4096 > > filestore queue max bytes = 10485760 > > filestore queue committing max bytes = 10485760 > > journal max write bytes = 10485760 > > journal queue max bytes = 10485760 > > ms dispatch throttle bytes = 10485760 > > objecter infilght op bytes = 10485760 > > > > Is this a known bug in this version? > > (Do you know some workaround to fix this?) > > > > --- > > Regards > > Dominik > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] large memory leak on scrubbing
Thanks for your response. Great. In latest cuttlefish it is also fixed I think? We have two problems with scrubbing: - memory leaks - slow requests and wrongly mark osd with bucket index down (when scrubbing) Now we decided to turn off scrubbing and trigger it on maintenance window. I noticed that "ceph osd scrub", or "ceph osd deep-scrub" trigger scrub on osd but not for all PG. It is possible to trigger scrubbing all PG on one osd? -- Regards Dominik -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: Saturday, August 17, 2013 5:11 PM To: Mostowiec Dominik Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com; StudziĆski Krzysztof; Sydor Bohdan Subject: Re: [ceph-users] large memory leak on scrubbing Hi Dominic, There is a bug fixed a couple of months back that fixes excessive memory consumption during scrub. You can upgrade to the latest 'bobtail' branch. See http://ceph.com/docs/master/install/debian/#development-testing-packages Installing that package should clear this up. sage On Fri, 16 Aug 2013, Mostowiec Dominik wrote: > Hi, > We noticed some issues on CEPH/S3 cluster, I think it related with scrubbing: > large memory leaks. > > Logs 09.xx: > https://www.dropbox.com/s/4z1fzg239j43igs/ceph-osd.4.log_09xx.tar.gz > >From 09.30 to 09.44 (14 minutes) osd.4 proces grows up to 28G. > > I think this is something curious: > 2013-08-16 09:43:48.801331 7f6570d2e700 0 log [WRN] : slow request > 32.794125 seconds old, received at 2013-08-16 09:43:16.007104: > osd_sub_op(unknown.0.0:0 16.113d 0//0//-1 [scrub-reserve] v 0'0 > snapset=0=[]:[] snapc=0=[]) v7 currently no flag points reached > > We have large rgw index and a lot of large files than on this cluster. > ceph version 0.56.6 (95a0bda7f007a33b0dc7adf4b330778fa1e5d70c) > Setup: > - 12 servers x 12 OSD > - 3 mons > Default scrubbing settings. > Journal and filestore settings: > journal aio = true > filestore flush min = 0 > filestore flusher = false > filestore fiemap = false > filestore op threads = 4 > filestore queue max ops = 4096 > filestore queue max bytes = 10485760 > filestore queue committing max bytes = 10485760 > journal max write bytes = 10485760 > journal queue max bytes = 10485760 > ms dispatch throttle bytes = 10485760 > objecter infilght op bytes = 10485760 > > Is this a known bug in this version? > (Do you know some workaround to fix this?) > > --- > Regards > Dominik > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] large memory leak on scrubbing
Hi Dominic, There is a bug fixed a couple of months back that fixes excessive memory consumption during scrub. You can upgrade to the latest 'bobtail' branch. See http://ceph.com/docs/master/install/debian/#development-testing-packages Installing that package should clear this up. sage On Fri, 16 Aug 2013, Mostowiec Dominik wrote: > Hi, > We noticed some issues on CEPH/S3 cluster, I think it related with scrubbing: > large memory leaks. > > Logs 09.xx: > https://www.dropbox.com/s/4z1fzg239j43igs/ceph-osd.4.log_09xx.tar.gz > >From 09.30 to 09.44 (14 minutes) osd.4 proces grows up to 28G. > > I think this is something curious: > 2013-08-16 09:43:48.801331 7f6570d2e700 0 log [WRN] : slow request 32.794125 > seconds old, received at 2013-08-16 09:43:16.007104: osd_sub_op(unknown.0.0:0 > 16.113d 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[]) v7 > currently no flag points reached > > We have large rgw index and a lot of large files than on this cluster. > ceph version 0.56.6 (95a0bda7f007a33b0dc7adf4b330778fa1e5d70c) > Setup: > - 12 servers x 12 OSD > - 3 mons > Default scrubbing settings. > Journal and filestore settings: > journal aio = true > filestore flush min = 0 > filestore flusher = false > filestore fiemap = false > filestore op threads = 4 > filestore queue max ops = 4096 > filestore queue max bytes = 10485760 > filestore queue committing max bytes = 10485760 > journal max write bytes = 10485760 > journal queue max bytes = 10485760 > ms dispatch throttle bytes = 10485760 > objecter infilght op bytes = 10485760 > > Is this a known bug in this version? > (Do you know some workaround to fix this?) > > --- > Regards > Dominik > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] large memory leak on scrubbing
Hi, > Is this a known bug in this version? Yes. > (Do you know some workaround to fix this?) Upgrade. Cheers, Sylvain ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com