Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-03-14 Thread David Zafman
The fix for tracker 20089 undid the changes you're seeing in the 15368 pull request.  The attr name mismatch of 'hinfo_key'  means that key is missing because every erasure coded object should have a key called "hinfo_key." You should try to determine why your extended attributes are getting

Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-03-13 Thread Graham Allan
Updated cluster now to 12.2.4 and the cycle of inconsistent->repair->unfound seems to continue, though possibly slightly differently. A pg does pass through an "active+clean" phase after repair, which might be new, but more likely I never observed it at the right time before. I see messages l

Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-19 Thread Graham Allan
On 02/17/2018 12:48 PM, David Zafman wrote: The commits below came after v12.2.2 and may impact this issue. When a pg is active+clean+inconsistent means that scrub has detected issues with 1 or more replicas of 1 or more objects .  An unfound object is a potentially temporary state in which

Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-17 Thread David Zafman
The commits below came after v12.2.2 and may impact this issue. When a pg is active+clean+inconsistent means that scrub has detected issues with 1 or more replicas of 1 or more objects .  An unfound object is a potentially temporary state in which the current set of available OSDs doesn't all

Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-16 Thread Gregory Farnum
On Fri, Feb 16, 2018 at 12:17 PM Graham Allan wrote: > On 02/16/2018 12:31 PM, Graham Allan wrote: > > > > If I set debug rgw=1 and demug ms=1 before running the "object stat" > > command, it seems to stall in a loop of trying communicate with osds for > > pool 96, which is .rgw.control > > > >>

Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-16 Thread Graham Allan
On 02/16/2018 12:31 PM, Graham Allan wrote: If I set debug rgw=1 and demug ms=1 before running the "object stat" command, it seems to stall in a loop of trying communicate with osds for pool 96, which is .rgw.control 10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 -- osd_op(unknown.0.0:54

Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-16 Thread Graham Allan
On 02/15/2018 05:33 PM, Gregory Farnum wrote: On Thu, Feb 15, 2018 at 3:10 PM Graham Allan > wrote: A lot more in xattrs which I won't paste, though the keys are: > root@cephmon1:~# ssh ceph03 find /var/lib/ceph/osd/ceph-295/current/70.3d6s0_head -name '*1089

Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-15 Thread Gregory Farnum
On Thu, Feb 15, 2018 at 3:10 PM Graham Allan wrote: > On 02/15/2018 11:58 AM, Gregory Farnum wrote: > > > > Well, if the objects were uploaded using multi-part upload I believe the > > objects you’re looking at here will only contain omap (or xattr?) > > entries pointing to the part files, so the

Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-15 Thread Graham Allan
On 02/15/2018 11:58 AM, Gregory Farnum wrote: Well, if the objects were uploaded using multi-part upload I believe the objects you’re looking at here will only contain omap (or xattr?) entries pointing to the part files, so the empty file data is to be expected. This might also make slightly

Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-15 Thread Gregory Farnum
On Thu, Feb 15, 2018 at 9:41 AM Graham Allan wrote: > Hi Greg, > > On 02/14/2018 11:49 AM, Gregory Farnum wrote: > > > > On Tue, Feb 13, 2018 at 8:41 AM Graham Allan > > wrote: > > > > I'm replying to myself here, but it's probably worth mentioning that > > after thi

Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-15 Thread Graham Allan
Hi Greg, On 02/14/2018 11:49 AM, Gregory Farnum wrote: On Tue, Feb 13, 2018 at 8:41 AM Graham Allan > wrote: I'm replying to myself here, but it's probably worth mentioning that after this started, I did bring back the failed host, though with "ceph osd weight

Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-14 Thread Gregory Farnum
On Tue, Feb 13, 2018 at 8:41 AM Graham Allan wrote: > I'm replying to myself here, but it's probably worth mentioning that > after this started, I did bring back the failed host, though with "ceph > osd weight 0" to avoid more data movement. > > For inconsistent pgs containing unfound objects, th

Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-13 Thread Graham Allan
I'm replying to myself here, but it's probably worth mentioning that after this started, I did bring back the failed host, though with "ceph osd weight 0" to avoid more data movement. For inconsistent pgs containing unfound objects, the output of "ceph pg query" does then show the original os

[ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-02-12 Thread Graham Allan
Hi, For the past few weeks I've been seeing a large number of pgs on our main erasure coded pool being flagged inconsistent, followed by them becoming active+recovery_wait+inconsistent with unfound objects. The cluster is currently running luminous 12.2.2 but has in the past also run its way