Re: [ceph-users] osd dies with m_filestore_fail_eio without dmesg error

Ronny Aasen Tue, 06 Sep 2016 09:34:53 -0700

On 06.09.2016 14:45, Ronny Aasen wrote:

On 06. sep. 2016 00:58, Brad Hubbard wrote:
On Mon, Sep 05, 2016 at 12:54:40PM +0200, Ronny Aasen wrote:
> Hello
>
> I have a osd that regularly dies on io, especially scrubbing.
> normaly i would assume a bad disk, and replace it. but then inormaly see
> messages in dmesg about the device and it's errors. for this OSD
> there are no errors in dmesg at all after a crash like this.
>
> this osd is a 5 disk software raid5 array. and it have had brokendisks in> the past that have been replaces and parity recalculated. runningXFS with a
> journal SSD partition.
>
>
> i can start the osd again and it works for a while. (several days)before it
> crashes again.
> could one of you look at the log for this osd and see if there isany way to
> salvage this osd?
>
> And is there any information i should gather before i scratch thefilesystem> and recreates it, perhaps there is some valuable insight intowhats's going
> on ??
>
> kind regards
> Ronny Aasen
>
>
> -1> 2016-09-05 12:09:28.185977 7eff0dbb9700 1 --10.24.12.22:6806/7970> --> 10.24.12.25:0/2640 -- osd_ping(ping_reply e106009 stamp2016-09-05
> 12:09:28.184760) v2 -- ?+0 0x6a634800 con 0x63888160
> 0> 2016-09-05 12:09:28.186884 7eff03ba5700 -1os/FileStore.cc: In> function 'virtual int FileStore::read(coll_t, const ghobject_t&,uint64_t,
> size_t, ceph::bufferlist&, uint32_t, bool)' thread 7eff03ba5700 time
> 2016-09-05 12:09:27.988279
> os/FileStore.cc: 2854: FAILED assert(allow_eio ||!m_filestore_fail_eio ||
> got != -5)
Error 5 is EIO or "I/O error" of course so it is receiving an I/Oerror when itattempts to read the file. According to this code [1] if youreproduce the errorwith "debug_filestore = 10" you should be able to retrieve the objectID and
find it on disk for inspection and comparison to the other replicas.

[1] https://github.com/ceph/ceph/blob/hammer/src/os/FileStore.cc#L2852

-- Cheers, Brad
thanks
have added debug_filestore=10 to this osd. and can see a lot more inthe logs, am going to leave it running until it crashes the next time,hopefully it will have some more details.
kind regards
Ronny Aasen


after a day's run the osd crashed again .

-37> 2016-09-06 18:12:07.690091 7f201ddf0700 10filestore(/var/lib/ceph/osd/ceph-106)FileStore::read(1.30b_head/1/38a7e30b/rbd_data.545f06238e1f29.0000000000016f21/head)pread error: (5) Input/output error

tryng to read the object manually also gave a IO error. so I rm'd theobject and let ceph recreate it. deep scrubbing should eventuallylocate all such issues on this osd.


thanks for the support. :)

Altho i do feel it is a bit drastic to crash the osd on a single corruptfile. it could have mv'd the file to a "pg_head/corrupted/../../../.."directory for safekeeping, and copied a working object from one of thereplicas. and if there was objects in a osd's corrupted directory itcould show a warn in ceph's status for the admin to inspect potentiallyfailing drives.


kind regards
Ronny Aasen




_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd dies with m_filestore_fail_eio without dmesg error

Reply via email to