[ceph-users] osd down after server failure

2013-10-14 Thread Dominik Mostowiec
Hi,
I had server failure that starts from one disk failure:
Oct 14 03:25:04 s3-10-177-64-6 kernel: [1027237.023986] sd 4:2:26:0:
[sdaa] Unhandled error code
Oct 14 03:25:04 s3-10-177-64-6 kernel: [1027237.023990] sd 4:2:26:0:
[sdaa]  Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Oct 14 03:25:04 s3-10-177-64-6 kernel: [1027237.023995] sd 4:2:26:0:
[sdaa] CDB: Read(10): 28 00 00 00 00 d0 00 00 10 00
Oct 14 03:25:04 s3-10-177-64-6 kernel: [1027237.024005] end_request:
I/O error, dev sdaa, sector 208
Oct 14 03:25:04 s3-10-177-64-6 kernel: [1027237.024744] XFS (sdaa):
metadata I/O error: block 0xd0 (xfs_trans_read_buf) error 5 buf
count 8192
Oct 14 03:25:04 s3-10-177-64-6 kernel: [1027237.025879] XFS (sdaa):
xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
Oct 14 03:25:28 s3-10-177-64-6 kernel: [1027260.820288] XFS (sdaa):
metadata I/O error: block 0xd0 (xfs_trans_read_buf) error 5 buf
count 8192
Oct 14 03:25:28 s3-10-177-64-6 kernel: [1027260.821194] XFS (sdaa):
xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
Oct 14 03:25:32 s3-10-177-64-6 kernel: [1027264.667851] XFS (sdaa):
metadata I/O error: block 0xd0 (xfs_trans_read_buf) error 5 buf
count 8192

this caused that the server has been unresponsive.

After server restart 3 of 26 osd on it are down.
In ceph-osd log after debug osd = 10 and restart is:

2013-10-14 06:21:23.141936 7fdeb4872700 -1 osd.47 43203 *** Got signal
Terminated ***
2013-10-14 06:21:23.142141 7fdeb4872700 -1 osd.47 43203  pausing thread pools
2013-10-14 06:21:23.142146 7fdeb4872700 -1 osd.47 43203  flushing io
2013-10-14 06:21:25.406187 7f02690f9780  0
filestore(/vol0/data/osd.47) mount FIEMAP ioctl is supported and
appears to work
2013-10-14 06:21:25.406204 7f02690f9780  0
filestore(/vol0/data/osd.47) mount FIEMAP ioctl is disabled via
'filestore fiemap' config option
2013-10-14 06:21:25.406557 7f02690f9780  0
filestore(/vol0/data/osd.47) mount did NOT detect btrfs
2013-10-14 06:21:25.412617 7f02690f9780  0
filestore(/vol0/data/osd.47) mount syncfs(2) syscall fully supported
(by glibc and kernel)
2013-10-14 06:21:25.412831 7f02690f9780  0
filestore(/vol0/data/osd.47) mount found snaps 
2013-10-14 06:21:25.415798 7f02690f9780  0
filestore(/vol0/data/osd.47) mount: enabling WRITEAHEAD journal mode:
btrfs not detected
2013-10-14 06:21:26.078377 7f02690f9780  2 osd.47 0 mounting
/vol0/data/osd.47 /vol0/data/osd.47/journal
2013-10-14 06:21:26.080872 7f02690f9780  0
filestore(/vol0/data/osd.47) mount FIEMAP ioctl is supported and
appears to work
2013-10-14 06:21:26.080885 7f02690f9780  0
filestore(/vol0/data/osd.47) mount FIEMAP ioctl is disabled via
'filestore fiemap' config option
2013-10-14 06:21:26.081289 7f02690f9780  0
filestore(/vol0/data/osd.47) mount did NOT detect btrfs
2013-10-14 06:21:26.087524 7f02690f9780  0
filestore(/vol0/data/osd.47) mount syncfs(2) syscall fully supported
(by glibc and kernel)
2013-10-14 06:21:26.087582 7f02690f9780  0
filestore(/vol0/data/osd.47) mount found snaps 
2013-10-14 06:21:26.089614 7f02690f9780  0
filestore(/vol0/data/osd.47) mount: enabling WRITEAHEAD journal mode:
btrfs not detected
2013-10-14 06:21:26.726676 7f02690f9780  2 osd.47 0 boot
2013-10-14 06:21:26.726773 7f02690f9780 10 osd.47 0 read_superblock
sb(16773c25-5054-4451-bf9f-efc1f7f21b89 osd.47
63cf7d70-99cb-0ab1-4006-002f e43203 [41261,43203]
lci=[43194,43203])
2013-10-14 06:21:26.726862 7f02690f9780 10 osd.47 0 add_map_bl 43203 82622 bytes
2013-10-14 06:21:26.727184 7f02690f9780 10 osd.47 43203 load_pgs
2013-10-14 06:21:26.727643 7f02690f9780 10 osd.47 43203 load_pgs
ignoring unrecognized meta
2013-10-14 06:21:26.727681 7f02690f9780 10 osd.47 43203 load_pgs
3.df1_TEMP clearing temp

osd.47 is still down, I put it out from cluster.
47  1   osd.47  down0

How can I check what is wrong?

ceph -v
ceph version 0.56.6 (95a0bda7f007a33b0dc7adf4b330778fa1e5d70c)

-- 
Pozdrawiam
Dominik
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd down after server failure

2013-10-14 Thread Dominik Mostowiec
Hi
I have found somthing.
After restart time was wrong on server (+2hours) before ntp has fixed it.
I restarted this 3 osd - it not helps.
It is possible that ceph banned this osd? Or after start with wrong
time osd has broken hi's filestore?

--
Regards
Dominik


2013/10/14 Dominik Mostowiec dominikmostow...@gmail.com:
 Hi,
 I had server failure that starts from one disk failure:
 Oct 14 03:25:04 s3-10-177-64-6 kernel: [1027237.023986] sd 4:2:26:0:
 [sdaa] Unhandled error code
 Oct 14 03:25:04 s3-10-177-64-6 kernel: [1027237.023990] sd 4:2:26:0:
 [sdaa]  Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
 Oct 14 03:25:04 s3-10-177-64-6 kernel: [1027237.023995] sd 4:2:26:0:
 [sdaa] CDB: Read(10): 28 00 00 00 00 d0 00 00 10 00
 Oct 14 03:25:04 s3-10-177-64-6 kernel: [1027237.024005] end_request:
 I/O error, dev sdaa, sector 208
 Oct 14 03:25:04 s3-10-177-64-6 kernel: [1027237.024744] XFS (sdaa):
 metadata I/O error: block 0xd0 (xfs_trans_read_buf) error 5 buf
 count 8192
 Oct 14 03:25:04 s3-10-177-64-6 kernel: [1027237.025879] XFS (sdaa):
 xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
 Oct 14 03:25:28 s3-10-177-64-6 kernel: [1027260.820288] XFS (sdaa):
 metadata I/O error: block 0xd0 (xfs_trans_read_buf) error 5 buf
 count 8192
 Oct 14 03:25:28 s3-10-177-64-6 kernel: [1027260.821194] XFS (sdaa):
 xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
 Oct 14 03:25:32 s3-10-177-64-6 kernel: [1027264.667851] XFS (sdaa):
 metadata I/O error: block 0xd0 (xfs_trans_read_buf) error 5 buf
 count 8192

 this caused that the server has been unresponsive.

 After server restart 3 of 26 osd on it are down.
 In ceph-osd log after debug osd = 10 and restart is:

 2013-10-14 06:21:23.141936 7fdeb4872700 -1 osd.47 43203 *** Got signal
 Terminated ***
 2013-10-14 06:21:23.142141 7fdeb4872700 -1 osd.47 43203  pausing thread pools
 2013-10-14 06:21:23.142146 7fdeb4872700 -1 osd.47 43203  flushing io
 2013-10-14 06:21:25.406187 7f02690f9780  0
 filestore(/vol0/data/osd.47) mount FIEMAP ioctl is supported and
 appears to work
 2013-10-14 06:21:25.406204 7f02690f9780  0
 filestore(/vol0/data/osd.47) mount FIEMAP ioctl is disabled via
 'filestore fiemap' config option
 2013-10-14 06:21:25.406557 7f02690f9780  0
 filestore(/vol0/data/osd.47) mount did NOT detect btrfs
 2013-10-14 06:21:25.412617 7f02690f9780  0
 filestore(/vol0/data/osd.47) mount syncfs(2) syscall fully supported
 (by glibc and kernel)
 2013-10-14 06:21:25.412831 7f02690f9780  0
 filestore(/vol0/data/osd.47) mount found snaps 
 2013-10-14 06:21:25.415798 7f02690f9780  0
 filestore(/vol0/data/osd.47) mount: enabling WRITEAHEAD journal mode:
 btrfs not detected
 2013-10-14 06:21:26.078377 7f02690f9780  2 osd.47 0 mounting
 /vol0/data/osd.47 /vol0/data/osd.47/journal
 2013-10-14 06:21:26.080872 7f02690f9780  0
 filestore(/vol0/data/osd.47) mount FIEMAP ioctl is supported and
 appears to work
 2013-10-14 06:21:26.080885 7f02690f9780  0
 filestore(/vol0/data/osd.47) mount FIEMAP ioctl is disabled via
 'filestore fiemap' config option
 2013-10-14 06:21:26.081289 7f02690f9780  0
 filestore(/vol0/data/osd.47) mount did NOT detect btrfs
 2013-10-14 06:21:26.087524 7f02690f9780  0
 filestore(/vol0/data/osd.47) mount syncfs(2) syscall fully supported
 (by glibc and kernel)
 2013-10-14 06:21:26.087582 7f02690f9780  0
 filestore(/vol0/data/osd.47) mount found snaps 
 2013-10-14 06:21:26.089614 7f02690f9780  0
 filestore(/vol0/data/osd.47) mount: enabling WRITEAHEAD journal mode:
 btrfs not detected
 2013-10-14 06:21:26.726676 7f02690f9780  2 osd.47 0 boot
 2013-10-14 06:21:26.726773 7f02690f9780 10 osd.47 0 read_superblock
 sb(16773c25-5054-4451-bf9f-efc1f7f21b89 osd.47
 63cf7d70-99cb-0ab1-4006-002f e43203 [41261,43203]
 lci=[43194,43203])
 2013-10-14 06:21:26.726862 7f02690f9780 10 osd.47 0 add_map_bl 43203 82622 
 bytes
 2013-10-14 06:21:26.727184 7f02690f9780 10 osd.47 43203 load_pgs
 2013-10-14 06:21:26.727643 7f02690f9780 10 osd.47 43203 load_pgs
 ignoring unrecognized meta
 2013-10-14 06:21:26.727681 7f02690f9780 10 osd.47 43203 load_pgs
 3.df1_TEMP clearing temp

 osd.47 is still down, I put it out from cluster.
 47  1   osd.47  down0

 How can I check what is wrong?

 ceph -v
 ceph version 0.56.6 (95a0bda7f007a33b0dc7adf4b330778fa1e5d70c)

 --
 Pozdrawiam
 Dominik



-- 
Pozdrawiam
Dominik
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com