On 06/26/2014 01:08 PM, Gregory Farnum wrote:
On Thu, Jun 26, 2014 at 12:52 PM, Kevin Horan
<kho...@globalrecordings.net> wrote:
I am also getting inconsistent object errors on a regular basis, about 1-2
every week or so for about 300GB of data. All OSDs are using XFS
filesystems. Some OSDs are individual 3TB internal hard drives and some are
external FC attached raid6 arrays. I am using this cluster to store kvm
images and I've noticed that the inconsistent objects always occur on my two
most recently created VM images, even though one of them is hardly ever used
(just a bare VM not put into production yet). This all started about 4
months ago on 0.72 and now is continuing to occur on version .80. I also
changed the number of replicas from 2 to 3 for the pool containing these
images and that had no effect.

Here is an example log entry:

2014-06-24 18:11:51.683310 7faf44297700  0 log [ERR] : 4.b6 shard 0: soid
c539a8b6/rbd_data.9fdea2ae8944a.00000000000004e2/head//4 digest 2541762784
!= known digest 3305022936
2014-06-24 18:11:52.107321 7faf50f60700  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) set_extsize: FSSETXATTR: (22)
Invalid argument
2014-06-24 18:11:52.215752 7faf5075f700  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) set_extsize: FSSETXATTR: (22)
Invalid argument
2014-06-24 18:11:52.365798 7faf50f60700  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) set_extsize: FSSETXATTR: (22)
Invalid argument
2014-06-24 18:11:52.674643 7faf5075f700  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) set_extsize: FSSETXATTR: (22)
Invalid argument
2014-06-24 18:11:52.749641 7faf50f60700  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) set_extsize: FSSETXATTR: (22)
Invalid argument
2014-06-24 18:11:55.194967 7faf5075f700  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) set_extsize: FSSETXATTR: (22)
Invalid argument
2014-06-24 18:11:55.259322 7faf50f60700  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) set_extsize: FSSETXATTR: (22)
Invalid argument
2014-06-24 18:11:55.526157 7faf5075f700  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) set_extsize: FSSETXATTR: (22)
Invalid argument
2014-06-24 18:11:55.547270 7faf44297700  0 log [ERR] : 4.b6 deep-scrub 0
missing, 1 inconsistent objects
2014-06-24 18:11:55.547282 7faf44297700  0 log [ERR] : 4.b6 deep-scrub 1
errors
Can you go find out what about those files is different? Are they
different sizes, with the overlapping pieces being the same? Are they
completely different?
  Here is the info block on three images:

root@vashti:~/t1# rbd info libvirt-pool/radosgw
rbd image 'radosgw':
        size 10000 MB in 2500 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.6aad02ae8944a
        format: 2
        features: layering

root@vashti:~/t1# rbd info libvirt-pool/auth-data
rbd image 'auth-data':
        size 10000 MB in 2500 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.9fdea2ae8944a
        format: 2
        features: layering

root@vashti:~/t1# rbd info libvirt-pool/auth
rbd image 'auth':
        size 10240 MB in 2560 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.51a3b2ae8944a
        format: 2
        features: layering
root@vashti:~/t1#

The first two incur the inconsistent objects, while the third one (which was created a year ago) does not (nor do my other older images). All of them are 10G in size, including the non-problematic one. I'm not sure what you mean by "overlapping pieces"?

Are your systems losing power or otherwise doing
mean things to the local filesystem?
I have not seen any kernel errors about file systems nor have I had any file system level problems.
  Have you noticed a pattern of
distribution in terms of the underlying storage system on the
inconsistent OSDs?
I have found the bad objects on PGs whose primary OSD was on a single internal drive, and in other cases the primary OSD was on an external drive.

About 3 months ago I had an event where 3 out of only 6 OSDs where down while noout was set (pool was set to size=2, min_size=1). About 2 minutes after these 3 OSDs came back up, another OSD, not one of these three, suffered a physical error and was lost. This resulted in about 10 or so lost objects. I soon got this all cleaned up got the cluster back to the clean state (see here <https://www.mail-archive.com/ceph-users@lists.ceph.com/msg09377.html> for the full story). But it was soon after that that I started getting these inconsistent objects. Prior to that event I had gone over a year without any inconsistent objects. There has also been a lot of re-structuring going on with new OSDs being added and/or moved (still getting it ready for production). But I always take one step and let it return to clean before taking the next step. When I got the first inconsistent object a simple repair didn't work so then I started trying some online suggestions of truncating objects to the correct size and/or removing objects. Some of these things caused some of the OSDs to crash and then not start again. I finally had to completely delete the image containing the bad objects and then the OSDs started to stay up all the time again. After that one though all the inconsistent objects have been fixable with a simple repair.

Thanks for you help.

Kevin

-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

Sometimes one of the objects has 0 size. I've also started getting the
FSSETXATTR errors recently, though I think that started after this problem
started. I've read elsewhere that these are harmless and will go away in a
future version.  I also looked in the monitor logs but didn't see any
reference to inconsistent or scrubbed objects.

Kevin
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to