Re: [ceph-users] Bug in RadosGW resharding? Hangs again...
On Wed, Jan 17, 2018 at 3:16 PM, Martin Emrich wrote: > Hi! > > > > I created a tracker ticket: http://tracker.ceph.com/issues/22721 > > It also happens without a lifecycle rule (only versioning). > > > Thanks. > I collected a log from the resharding process, after 10 minutes I canceled > it. Got 500MB log (gzipped still 20MB), so I cannot upload it to the bug > tracker. > > > I will try to reproduce it on my setup it should be simpler now I am sure it is the versioning. Orit > Regards, > > > > Martin > > > > > > *Von: *Orit Wasserman > *Datum: *Mittwoch, 17. Januar 2018 um 11:57 > > *An: *Martin Emrich > *Cc: *ceph-users > *Betreff: *Re: [ceph-users] Bug in RadosGW resharding? Hangs again... > > > > > > > > On Wed, Jan 17, 2018 at 11:45 AM, Martin Emrich > wrote: > > Hi Orit! > > > > I did some tests, and indeed the combination of Versioning/Lifecycle with > Resharding is the problem: > > > >- If I do not enable Versioning/Lifecycle, Autoresharding works fine. >- If I disable Autoresharding but enable Versioning+Lifecycle, pushing >data works fine, until I manually reshard. This hangs also. > > > > Thanks for testing :) This is very helpful! > > > > My lifecycle rule (which shall remove all versions older than 60 days): > > > > { > > "Rules": [{ > > "Status": "Enabled", > > "Prefix": "", > > "NoncurrentVersionExpiration": { > > "NoncurrentDays": 60 > > }, > > "Expiration": { > > "ExpiredObjectDeleteMarker": true > > }, > > "ID": "expire-60days" > > }] > > } > > > > I am currently testing with an application containing customer data, but I > am also creating some random test data to create logs I can share. > > I will also test whether the versioning itself is the culprit, or if it is > the lifecycle rule. > > > > > > I am suspecting versioning (never tried it with resharding). > > Can you open a tracker issue with all the information? > > > > Thanks, > > Orit > > > > Regards, > > Martin > > > > *Von: *Orit Wasserman > *Datum: *Dienstag, 16. Januar 2018 um 18:38 > *An: *Martin Emrich > *Cc: *ceph-users > *Betreff: *Re: [ceph-users] Bug in RadosGW resharding? Hangs again... > > > > Hi Martin, > > > > On Mon, Jan 15, 2018 at 6:04 PM, Martin Emrich > wrote: > > Hi! > > After having a completely broken radosgw setup due to damaged buckets, I > completely deleted all rgw pools, and started from scratch. > > But my problem is reproducible. After pushing ca. 10 objects into a > bucket, the resharding process appears to start, and the bucket is now > unresponsive. > > > > Sorry to hear that. > > Can you share radosgw logs with --debug_rgw=20 --debug_ms=1? > > > > I just see lots of these messages in all rgw logs: > > 2018-01-15 16:57:45.108826 7fd1779b1700 0 block_while_resharding ERROR: > bucket is still resharding, please retry > 2018-01-15 16:57:45.119184 7fd1779b1700 0 NOTICE: resharding operation on > bucket index detected, blocking > 2018-01-15 16:57:45.260751 7fd1120e6700 0 block_while_resharding ERROR: > bucket is still resharding, please retry > 2018-01-15 16:57:45.280410 7fd1120e6700 0 NOTICE: resharding operation on > bucket index detected, blocking > 2018-01-15 16:57:45.300775 7fd15b979700 0 block_while_resharding ERROR: > bucket is still resharding, please retry > 2018-01-15 16:57:45.300971 7fd15b979700 0 WARNING: set_req_state_err > err_no=2300 resorting to 500 > 2018-01-15 16:57:45.301042 7fd15b979700 0 ERROR: > RESTFUL_IO(s)->complete_header() > returned err=Input/output error > > One radosgw process and two OSDs housing the bucket index/metadata are > still busy, but it seems to be stuck again. > > How long is this resharding process supposed to take? I cannot believe > that an application is supposed to block for more than half an hour... > > I feel inclined to open a bug report, but I am yet unshure where the > problem lies. > > Some information: > > * 3 RGW processes, 3 OSD hosts with 12 HDD OSDs and 6 SSD OSDs > * Ceph 12.2.2 > * Auto-Resharding on, Bucket Versioning & Lifecycle rule enabled. > > > > What life cycle rules do you use? > > > > Regards, > > Orit > > Thanks, > > Martin > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bug in RadosGW resharding? Hangs again...
Hi! I created a tracker ticket: http://tracker.ceph.com/issues/22721 It also happens without a lifecycle rule (only versioning). I collected a log from the resharding process, after 10 minutes I canceled it. Got 500MB log (gzipped still 20MB), so I cannot upload it to the bug tracker. Regards, Martin Von: Orit Wasserman Datum: Mittwoch, 17. Januar 2018 um 11:57 An: Martin Emrich Cc: ceph-users Betreff: Re: [ceph-users] Bug in RadosGW resharding? Hangs again... On Wed, Jan 17, 2018 at 11:45 AM, Martin Emrich mailto:martin.emr...@empolis.com>> wrote: Hi Orit! I did some tests, and indeed the combination of Versioning/Lifecycle with Resharding is the problem: * If I do not enable Versioning/Lifecycle, Autoresharding works fine. * If I disable Autoresharding but enable Versioning+Lifecycle, pushing data works fine, until I manually reshard. This hangs also. Thanks for testing :) This is very helpful! My lifecycle rule (which shall remove all versions older than 60 days): { "Rules": [{ "Status": "Enabled", "Prefix": "", "NoncurrentVersionExpiration": { "NoncurrentDays": 60 }, "Expiration": { "ExpiredObjectDeleteMarker": true }, "ID": "expire-60days" }] } I am currently testing with an application containing customer data, but I am also creating some random test data to create logs I can share. I will also test whether the versioning itself is the culprit, or if it is the lifecycle rule. I am suspecting versioning (never tried it with resharding). Can you open a tracker issue with all the information? Thanks, Orit Regards, Martin Von: Orit Wasserman mailto:owass...@redhat.com>> Datum: Dienstag, 16. Januar 2018 um 18:38 An: Martin Emrich mailto:c...@empolis.com>> Cc: ceph-users mailto:ceph-users@lists.ceph.com>> Betreff: Re: [ceph-users] Bug in RadosGW resharding? Hangs again... Hi Martin, On Mon, Jan 15, 2018 at 6:04 PM, Martin Emrich mailto:martin.emr...@empolis.com>> wrote: Hi! After having a completely broken radosgw setup due to damaged buckets, I completely deleted all rgw pools, and started from scratch. But my problem is reproducible. After pushing ca. 10 objects into a bucket, the resharding process appears to start, and the bucket is now unresponsive. Sorry to hear that. Can you share radosgw logs with --debug_rgw=20 --debug_ms=1? I just see lots of these messages in all rgw logs: 2018-01-15 16:57:45.108826 7fd1779b1700 0 block_while_resharding ERROR: bucket is still resharding, please retry 2018-01-15 16:57:45.119184 7fd1779b1700 0 NOTICE: resharding operation on bucket index detected, blocking 2018-01-15 16:57:45.260751 7fd1120e6700 0 block_while_resharding ERROR: bucket is still resharding, please retry 2018-01-15 16:57:45.280410 7fd1120e6700 0 NOTICE: resharding operation on bucket index detected, blocking 2018-01-15 16:57:45.300775 7fd15b979700 0 block_while_resharding ERROR: bucket is still resharding, please retry 2018-01-15 16:57:45.300971 7fd15b979700 0 WARNING: set_req_state_err err_no=2300 resorting to 500 2018-01-15 16:57:45.301042 7fd15b979700 0 ERROR: RESTFUL_IO(s)->complete_header() returned err=Input/output error One radosgw process and two OSDs housing the bucket index/metadata are still busy, but it seems to be stuck again. How long is this resharding process supposed to take? I cannot believe that an application is supposed to block for more than half an hour... I feel inclined to open a bug report, but I am yet unshure where the problem lies. Some information: * 3 RGW processes, 3 OSD hosts with 12 HDD OSDs and 6 SSD OSDs * Ceph 12.2.2 * Auto-Resharding on, Bucket Versioning & Lifecycle rule enabled. What life cycle rules do you use? Regards, Orit Thanks, Martin ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bug in RadosGW resharding? Hangs again...
On Wed, Jan 17, 2018 at 11:45 AM, Martin Emrich wrote: > Hi Orit! > > > > I did some tests, and indeed the combination of Versioning/Lifecycle with > Resharding is the problem: > > > >- If I do not enable Versioning/Lifecycle, Autoresharding works fine. >- If I disable Autoresharding but enable Versioning+Lifecycle, pushing >data works fine, until I manually reshard. This hangs also. > > > Thanks for testing :) This is very helpful! My lifecycle rule (which shall remove all versions older than 60 days): > > > > { > > "Rules": [{ > > "Status": "Enabled", > > "Prefix": "", > > "NoncurrentVersionExpiration": { > > "NoncurrentDays": 60 > > }, > > "Expiration": { > > "ExpiredObjectDeleteMarker": true > > }, > > "ID": "expire-60days" > > }] > > } > > > > I am currently testing with an application containing customer data, but I > am also creating some random test data to create logs I can share. > > I will also test whether the versioning itself is the culprit, or if it is > the lifecycle rule. > > > I am suspecting versioning (never tried it with resharding). Can you open a tracker issue with all the information? Thanks, Orit Regards, > > Martin > > > > *Von: *Orit Wasserman > *Datum: *Dienstag, 16. Januar 2018 um 18:38 > *An: *Martin Emrich > *Cc: *ceph-users > *Betreff: *Re: [ceph-users] Bug in RadosGW resharding? Hangs again... > > > > Hi Martin, > > > > On Mon, Jan 15, 2018 at 6:04 PM, Martin Emrich > wrote: > > Hi! > > After having a completely broken radosgw setup due to damaged buckets, I > completely deleted all rgw pools, and started from scratch. > > But my problem is reproducible. After pushing ca. 10 objects into a > bucket, the resharding process appears to start, and the bucket is now > unresponsive. > > > > Sorry to hear that. > > Can you share radosgw logs with --debug_rgw=20 --debug_ms=1? > > > > I just see lots of these messages in all rgw logs: > > 2018-01-15 16:57:45.108826 7fd1779b1700 0 block_while_resharding ERROR: > bucket is still resharding, please retry > 2018-01-15 16:57:45.119184 7fd1779b1700 0 NOTICE: resharding operation on > bucket index detected, blocking > 2018-01-15 16:57:45.260751 7fd1120e6700 0 block_while_resharding ERROR: > bucket is still resharding, please retry > 2018-01-15 16:57:45.280410 7fd1120e6700 0 NOTICE: resharding operation on > bucket index detected, blocking > 2018-01-15 16:57:45.300775 7fd15b979700 0 block_while_resharding ERROR: > bucket is still resharding, please retry > 2018-01-15 16:57:45.300971 7fd15b979700 0 WARNING: set_req_state_err > err_no=2300 resorting to 500 > 2018-01-15 16:57:45.301042 7fd15b979700 0 ERROR: > RESTFUL_IO(s)->complete_header() > returned err=Input/output error > > One radosgw process and two OSDs housing the bucket index/metadata are > still busy, but it seems to be stuck again. > > How long is this resharding process supposed to take? I cannot believe > that an application is supposed to block for more than half an hour... > > I feel inclined to open a bug report, but I am yet unshure where the > problem lies. > > Some information: > > * 3 RGW processes, 3 OSD hosts with 12 HDD OSDs and 6 SSD OSDs > * Ceph 12.2.2 > * Auto-Resharding on, Bucket Versioning & Lifecycle rule enabled. > > > > What life cycle rules do you use? > > > > Regards, > > Orit > > Thanks, > > Martin > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bug in RadosGW resharding? Hangs again...
Hi Orit! I did some tests, and indeed the combination of Versioning/Lifecycle with Resharding is the problem: * If I do not enable Versioning/Lifecycle, Autoresharding works fine. * If I disable Autoresharding but enable Versioning+Lifecycle, pushing data works fine, until I manually reshard. This hangs also. My lifecycle rule (which shall remove all versions older than 60 days): { "Rules": [{ "Status": "Enabled", "Prefix": "", "NoncurrentVersionExpiration": { "NoncurrentDays": 60 }, "Expiration": { "ExpiredObjectDeleteMarker": true }, "ID": "expire-60days" }] } I am currently testing with an application containing customer data, but I am also creating some random test data to create logs I can share. I will also test whether the versioning itself is the culprit, or if it is the lifecycle rule. Regards, Martin Von: Orit Wasserman Datum: Dienstag, 16. Januar 2018 um 18:38 An: Martin Emrich Cc: ceph-users Betreff: Re: [ceph-users] Bug in RadosGW resharding? Hangs again... Hi Martin, On Mon, Jan 15, 2018 at 6:04 PM, Martin Emrich mailto:martin.emr...@empolis.com>> wrote: Hi! After having a completely broken radosgw setup due to damaged buckets, I completely deleted all rgw pools, and started from scratch. But my problem is reproducible. After pushing ca. 10 objects into a bucket, the resharding process appears to start, and the bucket is now unresponsive. Sorry to hear that. Can you share radosgw logs with --debug_rgw=20 --debug_ms=1? I just see lots of these messages in all rgw logs: 2018-01-15 16:57:45.108826 7fd1779b1700 0 block_while_resharding ERROR: bucket is still resharding, please retry 2018-01-15 16:57:45.119184 7fd1779b1700 0 NOTICE: resharding operation on bucket index detected, blocking 2018-01-15 16:57:45.260751 7fd1120e6700 0 block_while_resharding ERROR: bucket is still resharding, please retry 2018-01-15 16:57:45.280410 7fd1120e6700 0 NOTICE: resharding operation on bucket index detected, blocking 2018-01-15 16:57:45.300775 7fd15b979700 0 block_while_resharding ERROR: bucket is still resharding, please retry 2018-01-15 16:57:45.300971 7fd15b979700 0 WARNING: set_req_state_err err_no=2300 resorting to 500 2018-01-15 16:57:45.301042 7fd15b979700 0 ERROR: RESTFUL_IO(s)->complete_header() returned err=Input/output error One radosgw process and two OSDs housing the bucket index/metadata are still busy, but it seems to be stuck again. How long is this resharding process supposed to take? I cannot believe that an application is supposed to block for more than half an hour... I feel inclined to open a bug report, but I am yet unshure where the problem lies. Some information: * 3 RGW processes, 3 OSD hosts with 12 HDD OSDs and 6 SSD OSDs * Ceph 12.2.2 * Auto-Resharding on, Bucket Versioning & Lifecycle rule enabled. What life cycle rules do you use? Regards, Orit Thanks, Martin ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bug in RadosGW resharding? Hangs again...
Hi Martin, On Mon, Jan 15, 2018 at 6:04 PM, Martin Emrich wrote: > Hi! > > After having a completely broken radosgw setup due to damaged buckets, I > completely deleted all rgw pools, and started from scratch. > > But my problem is reproducible. After pushing ca. 10 objects into a > bucket, the resharding process appears to start, and the bucket is now > unresponsive. > > Sorry to hear that. Can you share radosgw logs with --debug_rgw=20 --debug_ms=1? > I just see lots of these messages in all rgw logs: > > 2018-01-15 16:57:45.108826 7fd1779b1700 0 block_while_resharding ERROR: > bucket is still resharding, please retry > 2018-01-15 16:57:45.119184 7fd1779b1700 0 NOTICE: resharding operation on > bucket index detected, blocking > 2018-01-15 16:57:45.260751 7fd1120e6700 0 block_while_resharding ERROR: > bucket is still resharding, please retry > 2018-01-15 16:57:45.280410 7fd1120e6700 0 NOTICE: resharding operation on > bucket index detected, blocking > 2018-01-15 16:57:45.300775 7fd15b979700 0 block_while_resharding ERROR: > bucket is still resharding, please retry > 2018-01-15 16:57:45.300971 7fd15b979700 0 WARNING: set_req_state_err > err_no=2300 resorting to 500 > 2018-01-15 16:57:45.301042 7fd15b979700 0 ERROR: > RESTFUL_IO(s)->complete_header() returned err=Input/output error > > One radosgw process and two OSDs housing the bucket index/metadata are > still busy, but it seems to be stuck again. > > How long is this resharding process supposed to take? I cannot believe > that an application is supposed to block for more than half an hour... > > I feel inclined to open a bug report, but I am yet unshure where the > problem lies. > > Some information: > > * 3 RGW processes, 3 OSD hosts with 12 HDD OSDs and 6 SSD OSDs > * Ceph 12.2.2 > * Auto-Resharding on, Bucket Versioning & Lifecycle rule enabled. > > What life cycle rules do you use? Regards, Orit > Thanks, > > Martin > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Bug in RadosGW resharding? Hangs again...
Hi! After having a completely broken radosgw setup due to damaged buckets, I completely deleted all rgw pools, and started from scratch. But my problem is reproducible. After pushing ca. 10 objects into a bucket, the resharding process appears to start, and the bucket is now unresponsive. I just see lots of these messages in all rgw logs: 2018-01-15 16:57:45.108826 7fd1779b1700 0 block_while_resharding ERROR: bucket is still resharding, please retry 2018-01-15 16:57:45.119184 7fd1779b1700 0 NOTICE: resharding operation on bucket index detected, blocking 2018-01-15 16:57:45.260751 7fd1120e6700 0 block_while_resharding ERROR: bucket is still resharding, please retry 2018-01-15 16:57:45.280410 7fd1120e6700 0 NOTICE: resharding operation on bucket index detected, blocking 2018-01-15 16:57:45.300775 7fd15b979700 0 block_while_resharding ERROR: bucket is still resharding, please retry 2018-01-15 16:57:45.300971 7fd15b979700 0 WARNING: set_req_state_err err_no=2300 resorting to 500 2018-01-15 16:57:45.301042 7fd15b979700 0 ERROR: RESTFUL_IO(s)->complete_header() returned err=Input/output error One radosgw process and two OSDs housing the bucket index/metadata are still busy, but it seems to be stuck again. How long is this resharding process supposed to take? I cannot believe that an application is supposed to block for more than half an hour... I feel inclined to open a bug report, but I am yet unshure where the problem lies. Some information: * 3 RGW processes, 3 OSD hosts with 12 HDD OSDs and 6 SSD OSDs * Ceph 12.2.2 * Auto-Resharding on, Bucket Versioning & Lifecycle rule enabled. Thanks, Martin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com