Re: [ceph-users] Bug in RadosGW resharding? Hangs again...

2018-01-17 Thread Orit Wasserman
On Wed, Jan 17, 2018 at 3:16 PM, Martin Emrich 
wrote:

> Hi!
>
>
>
> I created a tracker ticket: http://tracker.ceph.com/issues/22721
>
> It also happens without a lifecycle rule (only versioning).
>
>
>
Thanks.


> I collected a log from the resharding process, after 10 minutes I canceled
> it. Got 500MB log (gzipped still 20MB), so I cannot upload it to the bug
> tracker.
>
>
>
I will try to reproduce it on my setup it should be simpler now I am sure
it is the versioning.

Orit


> Regards,
>
>
>
> Martin
>
>
>
>
>
> *Von: *Orit Wasserman 
> *Datum: *Mittwoch, 17. Januar 2018 um 11:57
>
> *An: *Martin Emrich 
> *Cc: *ceph-users 
> *Betreff: *Re: [ceph-users] Bug in RadosGW resharding? Hangs again...
>
>
>
>
>
>
>
> On Wed, Jan 17, 2018 at 11:45 AM, Martin Emrich 
> wrote:
>
> Hi Orit!
>
>
>
> I did some tests, and indeed the combination of Versioning/Lifecycle with
> Resharding is the problem:
>
>
>
>- If I do not enable Versioning/Lifecycle, Autoresharding works fine.
>- If I disable Autoresharding but enable Versioning+Lifecycle, pushing
>data works fine, until I manually reshard. This hangs also.
>
>
>
> Thanks for testing :) This is very helpful!
>
>
>
> My lifecycle rule (which shall remove all versions older than 60 days):
>
>
>
> {
>
> "Rules": [{
>
> "Status": "Enabled",
>
> "Prefix": "",
>
> "NoncurrentVersionExpiration": {
>
> "NoncurrentDays": 60
>
> },
>
> "Expiration": {
>
> "ExpiredObjectDeleteMarker": true
>
> },
>
> "ID": "expire-60days"
>
> }]
>
> }
>
>
>
> I am currently testing with an application containing customer data, but I
> am also creating some random test data to create logs I can share.
>
> I will also test whether the versioning itself is the culprit, or if it is
> the lifecycle rule.
>
>
>
>
>
> I am suspecting versioning (never tried it with resharding).
>
> Can you open a tracker issue with all the information?
>
>
>
> Thanks,
>
> Orit
>
>
>
> Regards,
>
> Martin
>
>
>
> *Von: *Orit Wasserman 
> *Datum: *Dienstag, 16. Januar 2018 um 18:38
> *An: *Martin Emrich 
> *Cc: *ceph-users 
> *Betreff: *Re: [ceph-users] Bug in RadosGW resharding? Hangs again...
>
>
>
> Hi Martin,
>
>
>
> On Mon, Jan 15, 2018 at 6:04 PM, Martin Emrich 
> wrote:
>
> Hi!
>
> After having a completely broken radosgw setup due to damaged buckets, I
> completely deleted all rgw pools, and started from scratch.
>
> But my problem is reproducible. After pushing ca. 10 objects into a
> bucket, the resharding process appears to start, and the bucket is now
> unresponsive.
>
>
>
> Sorry to hear that.
>
> Can you share radosgw logs with --debug_rgw=20 --debug_ms=1?
>
>
>
> I just see lots of these messages in all rgw logs:
>
> 2018-01-15 16:57:45.108826 7fd1779b1700  0 block_while_resharding ERROR:
> bucket is still resharding, please retry
> 2018-01-15 16:57:45.119184 7fd1779b1700  0 NOTICE: resharding operation on
> bucket index detected, blocking
> 2018-01-15 16:57:45.260751 7fd1120e6700  0 block_while_resharding ERROR:
> bucket is still resharding, please retry
> 2018-01-15 16:57:45.280410 7fd1120e6700  0 NOTICE: resharding operation on
> bucket index detected, blocking
> 2018-01-15 16:57:45.300775 7fd15b979700  0 block_while_resharding ERROR:
> bucket is still resharding, please retry
> 2018-01-15 16:57:45.300971 7fd15b979700  0 WARNING: set_req_state_err
> err_no=2300 resorting to 500
> 2018-01-15 16:57:45.301042 7fd15b979700  0 ERROR: 
> RESTFUL_IO(s)->complete_header()
> returned err=Input/output error
>
> One radosgw process and two OSDs housing the bucket index/metadata are
> still busy, but it seems to be stuck again.
>
> How long is this resharding process supposed to take? I cannot believe
> that an application is supposed to block for more than half an hour...
>
> I feel inclined to open a bug report, but I am yet unshure where the
> problem lies.
>
> Some information:
>
> * 3 RGW processes, 3 OSD hosts with 12 HDD OSDs and 6 SSD OSDs
> * Ceph 12.2.2
> * Auto-Resharding on, Bucket Versioning & Lifecycle rule enabled.
>
>
>
> What life cycle rules do you use?
>
>
>
> Regards,
>
> Orit
>
> Thanks,
>
> Martin
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bug in RadosGW resharding? Hangs again...

2018-01-17 Thread Martin Emrich
Hi!

I created a tracker ticket: http://tracker.ceph.com/issues/22721
It also happens without a lifecycle rule (only versioning).

I collected a log from the resharding process, after 10 minutes I canceled it. 
Got 500MB log (gzipped still 20MB), so I cannot upload it to the bug tracker.

Regards,

Martin


Von: Orit Wasserman 
Datum: Mittwoch, 17. Januar 2018 um 11:57
An: Martin Emrich 
Cc: ceph-users 
Betreff: Re: [ceph-users] Bug in RadosGW resharding? Hangs again...



On Wed, Jan 17, 2018 at 11:45 AM, Martin Emrich 
mailto:martin.emr...@empolis.com>> wrote:
Hi Orit!

I did some tests, and indeed the combination of Versioning/Lifecycle with 
Resharding is the problem:


  *   If I do not enable Versioning/Lifecycle, Autoresharding works fine.
  *   If I disable Autoresharding but enable Versioning+Lifecycle, pushing data 
works fine, until I manually reshard. This hangs also.

Thanks for testing :) This is very helpful!

My lifecycle rule (which shall remove all versions older than 60 days):


{

"Rules": [{

"Status": "Enabled",

"Prefix": "",

"NoncurrentVersionExpiration": {

"NoncurrentDays": 60

},

"Expiration": {

"ExpiredObjectDeleteMarker": true

},

"ID": "expire-60days"

}]

}

I am currently testing with an application containing customer data, but I am 
also creating some random test data to create logs I can share.
I will also test whether the versioning itself is the culprit, or if it is the 
lifecycle rule.


I am suspecting versioning (never tried it with resharding).
Can you open a tracker issue with all the information?

Thanks,
Orit

Regards,
Martin

Von: Orit Wasserman mailto:owass...@redhat.com>>
Datum: Dienstag, 16. Januar 2018 um 18:38
An: Martin Emrich mailto:c...@empolis.com>>
Cc: ceph-users mailto:ceph-users@lists.ceph.com>>
Betreff: Re: [ceph-users] Bug in RadosGW resharding? Hangs again...

Hi Martin,

On Mon, Jan 15, 2018 at 6:04 PM, Martin Emrich 
mailto:martin.emr...@empolis.com>> wrote:
Hi!

After having a completely broken radosgw setup due to damaged buckets, I 
completely deleted all rgw pools, and started from scratch.

But my problem is reproducible. After pushing ca. 10 objects into a bucket, 
the resharding process appears to start, and the bucket is now unresponsive.

Sorry to hear that.
Can you share radosgw logs with --debug_rgw=20 --debug_ms=1?

I just see lots of these messages in all rgw logs:

2018-01-15 16:57:45.108826 7fd1779b1700  0 block_while_resharding ERROR: bucket 
is still resharding, please retry
2018-01-15 16:57:45.119184 7fd1779b1700  0 NOTICE: resharding operation on 
bucket index detected, blocking
2018-01-15 16:57:45.260751 7fd1120e6700  0 block_while_resharding ERROR: bucket 
is still resharding, please retry
2018-01-15 16:57:45.280410 7fd1120e6700  0 NOTICE: resharding operation on 
bucket index detected, blocking
2018-01-15 16:57:45.300775 7fd15b979700  0 block_while_resharding ERROR: bucket 
is still resharding, please retry
2018-01-15 16:57:45.300971 7fd15b979700  0 WARNING: set_req_state_err 
err_no=2300 resorting to 500
2018-01-15 16:57:45.301042 7fd15b979700  0 ERROR: 
RESTFUL_IO(s)->complete_header() returned err=Input/output error

One radosgw process and two OSDs housing the bucket index/metadata are still 
busy, but it seems to be stuck again.

How long is this resharding process supposed to take? I cannot believe that an 
application is supposed to block for more than half an hour...

I feel inclined to open a bug report, but I am yet unshure where the problem 
lies.

Some information:

* 3 RGW processes, 3 OSD hosts with 12 HDD OSDs and 6 SSD OSDs
* Ceph 12.2.2
* Auto-Resharding on, Bucket Versioning & Lifecycle rule enabled.

What life cycle rules do you use?

Regards,
Orit
Thanks,

Martin

___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bug in RadosGW resharding? Hangs again...

2018-01-17 Thread Orit Wasserman
On Wed, Jan 17, 2018 at 11:45 AM, Martin Emrich 
wrote:

> Hi Orit!
>
>
>
> I did some tests, and indeed the combination of Versioning/Lifecycle with
> Resharding is the problem:
>
>
>
>- If I do not enable Versioning/Lifecycle, Autoresharding works fine.
>- If I disable Autoresharding but enable Versioning+Lifecycle, pushing
>data works fine, until I manually reshard. This hangs also.
>
>
>
Thanks for testing :) This is very helpful!

My lifecycle rule (which shall remove all versions older than 60 days):
>
>
>
> {
>
> "Rules": [{
>
> "Status": "Enabled",
>
> "Prefix": "",
>
> "NoncurrentVersionExpiration": {
>
> "NoncurrentDays": 60
>
> },
>
> "Expiration": {
>
> "ExpiredObjectDeleteMarker": true
>
> },
>
> "ID": "expire-60days"
>
> }]
>
> }
>
>
>
> I am currently testing with an application containing customer data, but I
> am also creating some random test data to create logs I can share.
>
> I will also test whether the versioning itself is the culprit, or if it is
> the lifecycle rule.
>
>
>

I am suspecting versioning (never tried it with resharding).
Can you open a tracker issue with all the information?

Thanks,
Orit

Regards,
>
> Martin
>
>
>
> *Von: *Orit Wasserman 
> *Datum: *Dienstag, 16. Januar 2018 um 18:38
> *An: *Martin Emrich 
> *Cc: *ceph-users 
> *Betreff: *Re: [ceph-users] Bug in RadosGW resharding? Hangs again...
>
>
>
> Hi Martin,
>
>
>
> On Mon, Jan 15, 2018 at 6:04 PM, Martin Emrich 
> wrote:
>
> Hi!
>
> After having a completely broken radosgw setup due to damaged buckets, I
> completely deleted all rgw pools, and started from scratch.
>
> But my problem is reproducible. After pushing ca. 10 objects into a
> bucket, the resharding process appears to start, and the bucket is now
> unresponsive.
>
>
>
> Sorry to hear that.
>
> Can you share radosgw logs with --debug_rgw=20 --debug_ms=1?
>
>
>
> I just see lots of these messages in all rgw logs:
>
> 2018-01-15 16:57:45.108826 7fd1779b1700  0 block_while_resharding ERROR:
> bucket is still resharding, please retry
> 2018-01-15 16:57:45.119184 7fd1779b1700  0 NOTICE: resharding operation on
> bucket index detected, blocking
> 2018-01-15 16:57:45.260751 7fd1120e6700  0 block_while_resharding ERROR:
> bucket is still resharding, please retry
> 2018-01-15 16:57:45.280410 7fd1120e6700  0 NOTICE: resharding operation on
> bucket index detected, blocking
> 2018-01-15 16:57:45.300775 7fd15b979700  0 block_while_resharding ERROR:
> bucket is still resharding, please retry
> 2018-01-15 16:57:45.300971 7fd15b979700  0 WARNING: set_req_state_err
> err_no=2300 resorting to 500
> 2018-01-15 16:57:45.301042 7fd15b979700  0 ERROR: 
> RESTFUL_IO(s)->complete_header()
> returned err=Input/output error
>
> One radosgw process and two OSDs housing the bucket index/metadata are
> still busy, but it seems to be stuck again.
>
> How long is this resharding process supposed to take? I cannot believe
> that an application is supposed to block for more than half an hour...
>
> I feel inclined to open a bug report, but I am yet unshure where the
> problem lies.
>
> Some information:
>
> * 3 RGW processes, 3 OSD hosts with 12 HDD OSDs and 6 SSD OSDs
> * Ceph 12.2.2
> * Auto-Resharding on, Bucket Versioning & Lifecycle rule enabled.
>
>
>
> What life cycle rules do you use?
>
>
>
> Regards,
>
> Orit
>
> Thanks,
>
> Martin
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bug in RadosGW resharding? Hangs again...

2018-01-17 Thread Martin Emrich
Hi Orit!

I did some tests, and indeed the combination of Versioning/Lifecycle with 
Resharding is the problem:


  *   If I do not enable Versioning/Lifecycle, Autoresharding works fine.
  *   If I disable Autoresharding but enable Versioning+Lifecycle, pushing data 
works fine, until I manually reshard. This hangs also.

My lifecycle rule (which shall remove all versions older than 60 days):


{

"Rules": [{

"Status": "Enabled",

"Prefix": "",

"NoncurrentVersionExpiration": {

"NoncurrentDays": 60

},

"Expiration": {

"ExpiredObjectDeleteMarker": true

},

"ID": "expire-60days"

}]

}

I am currently testing with an application containing customer data, but I am 
also creating some random test data to create logs I can share.
I will also test whether the versioning itself is the culprit, or if it is the 
lifecycle rule.

Regards,
Martin

Von: Orit Wasserman 
Datum: Dienstag, 16. Januar 2018 um 18:38
An: Martin Emrich 
Cc: ceph-users 
Betreff: Re: [ceph-users] Bug in RadosGW resharding? Hangs again...

Hi Martin,

On Mon, Jan 15, 2018 at 6:04 PM, Martin Emrich 
mailto:martin.emr...@empolis.com>> wrote:
Hi!

After having a completely broken radosgw setup due to damaged buckets, I 
completely deleted all rgw pools, and started from scratch.

But my problem is reproducible. After pushing ca. 10 objects into a bucket, 
the resharding process appears to start, and the bucket is now unresponsive.

Sorry to hear that.
Can you share radosgw logs with --debug_rgw=20 --debug_ms=1?

I just see lots of these messages in all rgw logs:

2018-01-15 16:57:45.108826 7fd1779b1700  0 block_while_resharding ERROR: bucket 
is still resharding, please retry
2018-01-15 16:57:45.119184 7fd1779b1700  0 NOTICE: resharding operation on 
bucket index detected, blocking
2018-01-15 16:57:45.260751 7fd1120e6700  0 block_while_resharding ERROR: bucket 
is still resharding, please retry
2018-01-15 16:57:45.280410 7fd1120e6700  0 NOTICE: resharding operation on 
bucket index detected, blocking
2018-01-15 16:57:45.300775 7fd15b979700  0 block_while_resharding ERROR: bucket 
is still resharding, please retry
2018-01-15 16:57:45.300971 7fd15b979700  0 WARNING: set_req_state_err 
err_no=2300 resorting to 500
2018-01-15 16:57:45.301042 7fd15b979700  0 ERROR: 
RESTFUL_IO(s)->complete_header() returned err=Input/output error

One radosgw process and two OSDs housing the bucket index/metadata are still 
busy, but it seems to be stuck again.

How long is this resharding process supposed to take? I cannot believe that an 
application is supposed to block for more than half an hour...

I feel inclined to open a bug report, but I am yet unshure where the problem 
lies.

Some information:

* 3 RGW processes, 3 OSD hosts with 12 HDD OSDs and 6 SSD OSDs
* Ceph 12.2.2
* Auto-Resharding on, Bucket Versioning & Lifecycle rule enabled.

What life cycle rules do you use?

Regards,
Orit
Thanks,

Martin

___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bug in RadosGW resharding? Hangs again...

2018-01-16 Thread Orit Wasserman
Hi Martin,

On Mon, Jan 15, 2018 at 6:04 PM, Martin Emrich 
wrote:

> Hi!
>
> After having a completely broken radosgw setup due to damaged buckets, I
> completely deleted all rgw pools, and started from scratch.
>
> But my problem is reproducible. After pushing ca. 10 objects into a
> bucket, the resharding process appears to start, and the bucket is now
> unresponsive.
>
>
Sorry to hear that.
Can you share radosgw logs with --debug_rgw=20 --debug_ms=1?


> I just see lots of these messages in all rgw logs:
>
> 2018-01-15 16:57:45.108826 7fd1779b1700  0 block_while_resharding ERROR:
> bucket is still resharding, please retry
> 2018-01-15 16:57:45.119184 7fd1779b1700  0 NOTICE: resharding operation on
> bucket index detected, blocking
> 2018-01-15 16:57:45.260751 7fd1120e6700  0 block_while_resharding ERROR:
> bucket is still resharding, please retry
> 2018-01-15 16:57:45.280410 7fd1120e6700  0 NOTICE: resharding operation on
> bucket index detected, blocking
> 2018-01-15 16:57:45.300775 7fd15b979700  0 block_while_resharding ERROR:
> bucket is still resharding, please retry
> 2018-01-15 16:57:45.300971 7fd15b979700  0 WARNING: set_req_state_err
> err_no=2300 resorting to 500
> 2018-01-15 16:57:45.301042 7fd15b979700  0 ERROR:
> RESTFUL_IO(s)->complete_header() returned err=Input/output error
>
> One radosgw process and two OSDs housing the bucket index/metadata are
> still busy, but it seems to be stuck again.
>
> How long is this resharding process supposed to take? I cannot believe
> that an application is supposed to block for more than half an hour...
>
> I feel inclined to open a bug report, but I am yet unshure where the
> problem lies.
>
> Some information:
>
> * 3 RGW processes, 3 OSD hosts with 12 HDD OSDs and 6 SSD OSDs
> * Ceph 12.2.2
> * Auto-Resharding on, Bucket Versioning & Lifecycle rule enabled.
>
>
What life cycle rules do you use?

Regards,
Orit

> Thanks,
>
> Martin
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bug in RadosGW resharding? Hangs again...

2018-01-15 Thread Martin Emrich

Hi!

After having a completely broken radosgw setup due to damaged buckets, I 
completely deleted all rgw pools, and started from scratch.


But my problem is reproducible. After pushing ca. 10 objects into a 
bucket, the resharding process appears to start, and the bucket is now 
unresponsive.


I just see lots of these messages in all rgw logs:

2018-01-15 16:57:45.108826 7fd1779b1700  0 block_while_resharding ERROR: 
bucket is still resharding, please retry
2018-01-15 16:57:45.119184 7fd1779b1700  0 NOTICE: resharding operation 
on bucket index detected, blocking
2018-01-15 16:57:45.260751 7fd1120e6700  0 block_while_resharding ERROR: 
bucket is still resharding, please retry
2018-01-15 16:57:45.280410 7fd1120e6700  0 NOTICE: resharding operation 
on bucket index detected, blocking
2018-01-15 16:57:45.300775 7fd15b979700  0 block_while_resharding ERROR: 
bucket is still resharding, please retry
2018-01-15 16:57:45.300971 7fd15b979700  0 WARNING: set_req_state_err 
err_no=2300 resorting to 500
2018-01-15 16:57:45.301042 7fd15b979700  0 ERROR: 
RESTFUL_IO(s)->complete_header() returned err=Input/output error


One radosgw process and two OSDs housing the bucket index/metadata are 
still busy, but it seems to be stuck again.


How long is this resharding process supposed to take? I cannot believe 
that an application is supposed to block for more than half an hour...


I feel inclined to open a bug report, but I am yet unshure where the 
problem lies.


Some information:

* 3 RGW processes, 3 OSD hosts with 12 HDD OSDs and 6 SSD OSDs
* Ceph 12.2.2
* Auto-Resharding on, Bucket Versioning & Lifecycle rule enabled.

Thanks,

Martin

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com