On 06.09.2016 14:45, Ronny Aasen wrote:
On 06. sep. 2016 00:58, Brad Hubbard wrote:
On Mon, Sep 05, 2016 at 12:54:40PM +0200, Ronny Aasen wrote:
> Hello
>
> I have a osd that regularly dies on io, especially scrubbing.
> normaly i would assume a bad disk, and replace it. but then i
normaly see
> messages in dmesg about the device and it's errors. for this OSD
> there are no errors in dmesg at all after a crash like this.
>
> this osd is a 5 disk software raid5 array. and it have had broken
disks in
> the past that have been replaces and parity recalculated. running
XFS with a
> journal SSD partition.
>
>
> i can start the osd again and it works for a while. (several days)
before it
> crashes again.
> could one of you look at the log for this osd and see if there is
any way to
> salvage this osd?
>
> And is there any information i should gather before i scratch the
filesystem
> and recreates it, perhaps there is some valuable insight into
whats's going
> on ??
>
> kind regards
> Ronny Aasen
>
>
> -1> 2016-09-05 12:09:28.185977 7eff0dbb9700 1 --
10.24.12.22:6806/7970
> --> 10.24.12.25:0/2640 -- osd_ping(ping_reply e106009 stamp
2016-09-05
> 12:09:28.184760) v2 -- ?+0 0x6a634800 con 0x63888160
> 0> 2016-09-05 12:09:28.186884 7eff03ba5700 -1
os/FileStore.cc: In
> function 'virtual int FileStore::read(coll_t, const ghobject_t&,
uint64_t,
> size_t, ceph::bufferlist&, uint32_t, bool)' thread 7eff03ba5700 time
> 2016-09-05 12:09:27.988279
> os/FileStore.cc: 2854: FAILED assert(allow_eio ||
!m_filestore_fail_eio ||
> got != -5)
Error 5 is EIO or "I/O error" of course so it is receiving an I/O
error when it
attempts to read the file. According to this code [1] if you
reproduce the error
with "debug_filestore = 10" you should be able to retrieve the object
ID and
find it on disk for inspection and comparison to the other replicas.
[1] https://github.com/ceph/ceph/blob/hammer/src/os/FileStore.cc#L2852
-- Cheers, Brad
thanks
have added debug_filestore=10 to this osd. and can see a lot more in
the logs, am going to leave it running until it crashes the next time,
hopefully it will have some more details.
kind regards
Ronny Aasen
after a day's run the osd crashed again .
-37> 2016-09-06 18:12:07.690091 7f201ddf0700 10
filestore(/var/lib/ceph/osd/ceph-106)
FileStore::read(1.30b_head/1/38a7e30b/rbd_data.545f06238e1f29.0000000000016f21/head)
pread error: (5) Input/output error
tryng to read the object manually also gave a IO error. so I rm'd the
object and let ceph recreate it. deep scrubbing should eventually
locate all such issues on this osd.
thanks for the support. :)
Altho i do feel it is a bit drastic to crash the osd on a single corrupt
file. it could have mv'd the file to a "pg_head/corrupted/../../../.."
directory for safekeeping, and copied a working object from one of the
replicas. and if there was objects in a osd's corrupted directory it
could show a warn in ceph's status for the admin to inspect potentially
failing drives.
kind regards
Ronny Aasen
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com