On 06.09.2016 14:45, Ronny Aasen wrote:
On 06. sep. 2016 00:58, Brad Hubbard wrote:
On Mon, Sep 05, 2016 at 12:54:40PM +0200, Ronny Aasen wrote:
> Hello
>
> I have a osd that regularly dies on io, especially scrubbing.
> normaly i would assume a bad disk, and replace it. but then i normaly see
> messages in dmesg about the device and it's errors. for this OSD
> there are no errors in dmesg at all after a crash like this.
>
> this osd is a 5 disk software raid5 array. and it have had broken disks in > the past that have been replaces and parity recalculated. running XFS with a
> journal SSD partition.
>
>
> i can start the osd again and it works for a while. (several days) before it
> crashes again.
> could one of you look at the log for this osd and see if there is any way to
> salvage this osd?
>
> And is there any information i should gather before i scratch the filesystem > and recreates it, perhaps there is some valuable insight into whats's going
> on ??
>
> kind regards
> Ronny Aasen
>
>
> -1> 2016-09-05 12:09:28.185977 7eff0dbb9700 1 -- 10.24.12.22:6806/7970 > --> 10.24.12.25:0/2640 -- osd_ping(ping_reply e106009 stamp 2016-09-05
> 12:09:28.184760) v2 -- ?+0 0x6a634800 con 0x63888160
> 0> 2016-09-05 12:09:28.186884 7eff03ba5700 -1 os/FileStore.cc: In > function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
> size_t, ceph::bufferlist&, uint32_t, bool)' thread 7eff03ba5700 time
> 2016-09-05 12:09:27.988279
> os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio ||
> got != -5)


Error 5 is EIO or "I/O error" of course so it is receiving an I/O error when it attempts to read the file. According to this code [1] if you reproduce the error with "debug_filestore = 10" you should be able to retrieve the object ID and
find it on disk for inspection and comparison to the other replicas.

[1] https://github.com/ceph/ceph/blob/hammer/src/os/FileStore.cc#L2852

-- Cheers, Brad

thanks
have added debug_filestore=10 to this osd. and can see a lot more in the logs, am going to leave it running until it crashes the next time, hopefully it will have some more details.

kind regards
Ronny Aasen

after a day's run the osd crashed again .

-37> 2016-09-06 18:12:07.690091 7f201ddf0700 10 filestore(/var/lib/ceph/osd/ceph-106) FileStore::read(1.30b_head/1/38a7e30b/rbd_data.545f06238e1f29.0000000000016f21/head) pread error: (5) Input/output error

tryng to read the object manually also gave a IO error. so I rm'd the object and let ceph recreate it. deep scrubbing should eventually locate all such issues on this osd.

thanks for the support. :)

Altho i do feel it is a bit drastic to crash the osd on a single corrupt file. it could have mv'd the file to a "pg_head/corrupted/../../../.." directory for safekeeping, and copied a working object from one of the replicas. and if there was objects in a osd's corrupted directory it could show a warn in ceph's status for the admin to inspect potentially failing drives.

kind regards
Ronny Aasen




_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to