Re: Ceph version 0.56.1, data loss on power failure
Hi Marcin, Not sure if anyone asked, but are your OSD journals on actual disk or are you using tmpfs? Dino On Wed, Jan 16, 2013 at 4:53 AM, Wido den Hollander w...@widodh.nl wrote: On 01/16/2013 11:50 AM, Marcin Szukala wrote: Hi all, Any ideas how can I resolve my issue? Or where the problem is? Let me describe the issue. Host boots up and maps RBD image with XFS filesystems Host mounts the filesystems from the RBD image Host starts to write data to the mounted filesystems Host experiences power failure Host comes up and map the RBD image Host mounts the filesystems from the RBD image All data from all filesystems is lost Host is able to use the filesystems with no problems. Filesystem is XFS, no errors on filesystem, That simply does not make sense to me. How can all data be gone and the FS just mount cleanly. Can you try to format the RBD with EXT4 and see if that makes any difference. Could you also try to run a sync prior to pulling the power from the host to see if that makes any difference. Wido Kernel 3.5.0-19-generic root@openstack-1:/etc/init# ceph -s health HEALTH_OK monmap e1: 3 mons at {a=10.3.82.102:6789/0,b=10.3.82.103:6789/0,d=10.3.82.105:6789/0}, election epoch 10, quorum 0,1,2 a,b,d osdmap e132: 56 osds: 56 up, 56 in pgmap v87165: 13744 pgs: 13744 active+clean; 52727 MB data, 102 GB used, 52028 GB / 52131 GB avail mdsmap e1: 0/0/1 up Regards, Marcin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- __ Dino Yancey 2GNT.com Admin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph version 0.56.1, data loss on power failure
Le 16/01/2013 11:53, Wido den Hollander a écrit : On 01/16/2013 11:50 AM, Marcin Szukala wrote: Hi all, Any ideas how can I resolve my issue? Or where the problem is? Let me describe the issue. Host boots up and maps RBD image with XFS filesystems Host mounts the filesystems from the RBD image Host starts to write data to the mounted filesystems Host experiences power failure you are not doing sync there, right ? Host comes up and map the RBD image Host mounts the filesystems from the RBD image All data from all filesystems is lost Host is able to use the filesystems with no problems. Filesystem is XFS, no errors on filesystem, you MAY have hit an XFS issue. Please follow XFS list, in particular this thread : http://oss.sgi.com/pipermail/xfs/2012-December/023021.html If i Remember well, this one is after 3.4 kernel, and I think the fix isn't in the current ubuntu kernel. cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph version 0.56.1, data loss on power failure
Hi Dino, journals are on dedicated SSD Regard, Marcin 2013/1/16 Dino Yancey dino2...@gmail.com: Hi Marcin, Not sure if anyone asked, but are your OSD journals on actual disk or are you using tmpfs? Dino On Wed, Jan 16, 2013 at 4:53 AM, Wido den Hollander w...@widodh.nl wrote: On 01/16/2013 11:50 AM, Marcin Szukala wrote: Hi all, Any ideas how can I resolve my issue? Or where the problem is? Let me describe the issue. Host boots up and maps RBD image with XFS filesystems Host mounts the filesystems from the RBD image Host starts to write data to the mounted filesystems Host experiences power failure Host comes up and map the RBD image Host mounts the filesystems from the RBD image All data from all filesystems is lost Host is able to use the filesystems with no problems. Filesystem is XFS, no errors on filesystem, That simply does not make sense to me. How can all data be gone and the FS just mount cleanly. Can you try to format the RBD with EXT4 and see if that makes any difference. Could you also try to run a sync prior to pulling the power from the host to see if that makes any difference. Wido Kernel 3.5.0-19-generic root@openstack-1:/etc/init# ceph -s health HEALTH_OK monmap e1: 3 mons at {a=10.3.82.102:6789/0,b=10.3.82.103:6789/0,d=10.3.82.105:6789/0}, election epoch 10, quorum 0,1,2 a,b,d osdmap e132: 56 osds: 56 up, 56 in pgmap v87165: 13744 pgs: 13744 active+clean; 52727 MB data, 102 GB used, 52028 GB / 52131 GB avail mdsmap e1: 0/0/1 up Regards, Marcin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- __ Dino Yancey 2GNT.com Admin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph version 0.56.1, data loss on power failure
2013/1/16 Yann Dupont yann.dup...@univ-nantes.fr: Le 16/01/2013 11:53, Wido den Hollander a écrit : On 01/16/2013 11:50 AM, Marcin Szukala wrote: Hi all, Any ideas how can I resolve my issue? Or where the problem is? Let me describe the issue. Host boots up and maps RBD image with XFS filesystems Host mounts the filesystems from the RBD image Host starts to write data to the mounted filesystems Host experiences power failure you are not doing sync there, right ? Nope, no sync. Host comes up and map the RBD image Host mounts the filesystems from the RBD image All data from all filesystems is lost Host is able to use the filesystems with no problems. Filesystem is XFS, no errors on filesystem, you MAY have hit an XFS issue. Please follow XFS list, in particular this thread : http://oss.sgi.com/pipermail/xfs/2012-December/023021.html If i Remember well, this one is after 3.4 kernel, and I think the fix isn't in the current ubuntu kernel. It looks like it, with ext4 I have no issue. Also if i do sync, the data is not lost. Thank You All for help. Regards, Marcin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: REMINDER: all argonaut users should upgrade to v0.48.3argonaut
Can we use this doc as a reference for the upgrade? https://github.com/ceph/ceph/blob/eb02eaede53c03579d015ca00a888a48dbab739a/doc/install/upgrading-ceph.rst Thanks. -- Regards, Sébastien Han. On Tue, Jan 15, 2013 at 10:49 PM, Sage Weil s...@inktank.com wrote: That there are some critical bugs that are fixed in v0.48.3, including one that can lead to data loss in power loss or kernel panic situations. Please upgrade if you have not already done so! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed
Hi Sage, On 01/15/2013 07:55 PM, Sage Weil wrote: Hi Jim- I just realized this didn't make it into our tree. It's now in testing, and will get merged in the next window. D'oh! That's great news - thanks for the update. -- Jim sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD don't start after upgrade form 0.47.2 to 0.56.1
Hi List, i tried to upgrade my ceph cluster from 0.47.2 (openSuSE buildservice for SLES 11 SP2) to 0.56.1 (ceph.com/rpm/sles11/) At first I updated only one server (mon.b / osd.2) and restartet ceph on this server. After a short time /etc/init.d/ceph -a status showed not running for most osd At this time i tried stopping ceph on all hosts, but some osd processes where hanging in diskwait. I updated the others and after the processes where still not responsiv i rebootet the systems. After restarting ceph, the osds updated the filesystem, but stoped short afterwards. hpb020102 had the following log entries -- 2013-01-16 15:40:25.297036 7fd348387760 0 filestore(/srv/osd.2) mount FIEMAP io ctl is supported and appears to work 2013-01-16 15:40:25.297049 7fd348387760 0 filestore(/srv/osd.2) mount FIEMAP io ctl is disabled via 'filestore fiemap' config option 2013-01-16 15:40:25.297392 7fd348387760 0 filestore(/srv/osd.2) mount did NOT d etect btrfs 2013-01-16 15:40:25.297402 7fd348387760 0 filestore(/srv/osd.2) mount syncfs(2) syscall not supported 2013-01-16 15:40:25.297405 7fd348387760 0 filestore(/srv/osd.2) mount no syncfs (2), must use sync(2). 2013-01-16 15:40:25.297407 7fd348387760 0 filestore(/srv/osd.2) mount WARNING: multiple ceph-osd daemons on the same host will be slow 2013-01-16 15:40:25.297480 7fd348387760 0 filestore(/srv/osd.2) mount found sna ps 2013-01-16 15:40:25.364304 7fd348387760 0 filestore(/srv/osd.2) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2013-01-16 15:40:25.373353 7fd348387760 1 journal _open /srv/osd.2.journal fd 2 1: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0 2013-01-16 15:40:25.373431 7fd348387760 1 journal _open /srv/osd.2.journal fd 2 1: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0 2013-01-16 15:40:25.374388 7fd348387760 1 journal close /srv/osd.2.journal 2013-01-16 15:40:25.430719 7fd348387760 0 filestore(/srv/osd.2) mount FIEMAP io ctl is supported and appears to work 2013-01-16 15:40:25.430731 7fd348387760 0 filestore(/srv/osd.2) mount FIEMAP io ctl is disabled via 'filestore fiemap' config option 2013-01-16 15:40:25.431011 7fd348387760 0 filestore(/srv/osd.2) mount did NOT d etect btrfs 2013-01-16 15:40:25.431017 7fd348387760 0 filestore(/srv/osd.2) mount syncfs(2) syscall not supported 2013-01-16 15:40:25.431018 7fd348387760 0 filestore(/srv/osd.2) mount no syncfs (2), must use sync(2). 2013-01-16 15:40:25.431019 7fd348387760 0 filestore(/srv/osd.2) mount WARNING: multiple ceph-osd daemons on the same host will be slow 2013-01-16 15:40:25.431041 7fd348387760 0 filestore(/srv/osd.2) mount found sna ps 2013-01-16 15:40:25.489620 7fd348387760 0 filestore(/srv/osd.2) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2013-01-16 15:40:25.494361 7fd348387760 1 journal _open /srv/osd.2.journal fd 2 9: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0 2013-01-16 15:40:25.494417 7fd348387760 1 journal _open /srv/osd.2.journal fd 2 9: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0 2013-01-16 15:40:25.494679 7fd348387760 -1 filestore(/srv/osd.2) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory 2013-01-16 15:40:25.494694 7fd348387760 -1 osd.2 0 OSD::init() : unable to read osd superblock 2013-01-16 15:40:25.495001 7fd348387760 1 journal close /srv/osd.2.journal 2013-01-16 15:40:25.495665 7fd348387760 -1 ESC[0;31m ** ERROR: osd init failed: (22) Invalid argumentESC[0m --- hpb020103-hpb020106 showed 2013-01-16 15:47:56.886005 7f504e1e9760 0 filestore(/srv/osd.5) mount FIEMAP io ctl is supported and appears to work 2013-01-16 15:47:56.886017 7f504e1e9760 0 filestore(/srv/osd.5) mount FIEMAP io ctl is disabled via 'filestore fiemap' config option 2013-01-16 15:47:56.886291 7f504e1e9760 0 filestore(/srv/osd.5) mount did NOT d etect btrfs 2013-01-16 15:47:56.886298 7f504e1e9760 0 filestore(/srv/osd.5) mount syncfs(2) syscall not supported 2013-01-16 15:47:56.886300 7f504e1e9760 0 filestore(/srv/osd.5) mount no syncfs (2), must use sync(2). 2013-01-16 15:47:56.886301 7f504e1e9760 0 filestore(/srv/osd.5) mount WARNING: multiple ceph-osd daemons on the same host will be slow 2013-01-16 15:47:56.886351 7f504e1e9760 0 filestore(/srv/osd.5) mount found sna ps 2013-01-16 15:47:56.945149 7f504e1e9760 0 filestore(/srv/osd.5) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2013-01-16 15:47:56.953456 7f504e1e9760 1 journal _open /srv/osd.5.journal fd 2 1: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0 2013-01-16 15:47:56.953545 7f504e1e9760 1 journal _open /srv/osd.5.journal fd 2 1: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0 2013-01-16 15:47:56.955011 7f504e1e9760 1 journal close /srv/osd.5.journal 2013-01-16
8 out of 12 OSDs died after expansion on 0.56.1 (void OSD::do_waiters())
Hi, I'm testing a small Ceph cluster with Asus C60M1-1 mainboards. The setup is: - AMD Fusion C60 CPU - 8GB DDR3 - 1x Intel 520 120GB SSD (OS + Journaling) - 4x 1TB disk I had two of these systems running, but yesterday I wanted to add a third one. So I had 8 OSDs (one per disk) running on 0.56.1 and I added one host bringing the total to 12. The cluster came into a degraded state (about 50%) and it started to recover until it reached somewhere about 48% In a manner of about 5 minutes all the original 8 OSDs had crashed with the same backtrace: -1 2013-01-15 17:20:29.058426 7f95a0fd8700 10 -- [2a00:f10:113:0:6051:e06c:df3:f374]:6803/4913 reaper done 0 2013-01-15 17:20:29.061054 7f959cfd0700 -1 osd/OSD.cc: In function 'void OSD::do_waiters()' thread 7f959cfd0700 time 2013-01-15 17:20:29.057714 osd/OSD.cc: 3318: FAILED assert(osd_lock.is_locked()) ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: (OSD::do_waiters()+0x2c3) [0x6251f3] 2: (OSD::ms_dispatch(Message*)+0x1c4) [0x62d714] 3: (DispatchQueue::entry()+0x349) [0x8ba289] 4: (DispatchQueue::DispatchThread::entry()+0xd) [0x8137cd] 5: (()+0x7e9a) [0x7f95a95dae9a] 6: (clone()+0x6d) [0x7f95a805ecbd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. So osd.0 - osd.7 were down and osd.8 - osd.11 (the new ones) were still running happily. I have to note that during this recovery the load of the first two machines spiked to 10 and the CPUs were 0% idle. This morning I started all the OSDs again with a default loglevel since I don't want to stress the CPUs even more. I know the C60 CPU is kind of limited, but it's a test-case! The recovery started again and it showed about 90MB/sec (Gbit network) coming into the new node. After about 4 hours the recovery successfully completed: 736 pgs: 1736 active+clean; 837 GB data, 1671 GB used, 9501 GB / 11172 GB avail Now, there was no high logging level on the OSDs prior to their crash, I only have the default logs. And nothing happened after I started them again, all 12 are up now. Is this a known one? If not, I'll file a bug in the tracker. Wido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: REMINDER: all argonaut users should upgrade to v0.48.3argonaut
On Wed, 16 Jan 2013, S?bastien Han wrote: Can we use this doc as a reference for the upgrade? https://github.com/ceph/ceph/blob/eb02eaede53c03579d015ca00a888a48dbab739a/doc/install/upgrading-ceph.rst Yeah. It's pretty simple in this case (since it's a point release upgrade): - install new package everywhere - restart daemons (at any rate, in any order) The most important ones to upgrade in this case are the ceph-osd's. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph version 0.56.1, data loss on power failure
FWIW, my ceph data dirs (for e.g. mons) are all on XFS. I've experienced a lot of corruption on these on power loss to the node -- and in some cases even when power wasn't lost, and the box was simply rebooted. This is on Ubuntu 12.04 with the ceph-provied 3.6.3 kernel (as I'm using RBD on these). It's pretty much to the point where I'm thinking of changing them all over to ext4 for these data dirs, as the hassle of rebuilding mons constantly is just not worth the trouble. --Jeff On Wed, Jan 16, 2013 at 9:32 AM, Marcin Szukala szukala.mar...@gmail.com wrote: 2013/1/16 Yann Dupont yann.dup...@univ-nantes.fr: Le 16/01/2013 11:53, Wido den Hollander a écrit : On 01/16/2013 11:50 AM, Marcin Szukala wrote: Hi all, Any ideas how can I resolve my issue? Or where the problem is? Let me describe the issue. Host boots up and maps RBD image with XFS filesystems Host mounts the filesystems from the RBD image Host starts to write data to the mounted filesystems Host experiences power failure you are not doing sync there, right ? Nope, no sync. Host comes up and map the RBD image Host mounts the filesystems from the RBD image All data from all filesystems is lost Host is able to use the filesystems with no problems. Filesystem is XFS, no errors on filesystem, you MAY have hit an XFS issue. Please follow XFS list, in particular this thread : http://oss.sgi.com/pipermail/xfs/2012-December/023021.html If i Remember well, this one is after 3.4 kernel, and I think the fix isn't in the current ubuntu kernel. It looks like it, with ext4 I have no issue. Also if i do sync, the data is not lost. Thank You All for help. Regards, Marcin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph slow request unstable issue
Hi, On Wed, 16 Jan 2013, Andrey Korolyov wrote: On Wed, Jan 16, 2013 at 4:58 AM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote: Hi list, We are suffering from OSD or OS down when there is continuing high pressure on the Ceph rack. Basically we are on Ubuntu 12.04+ Ceph 0.56.1, 6 nodes, in each nodes with 20 * spindles + 4* SSDs as journal.(120 spindles in total) We create a lots of RBD volumes (say 240),mounting to 16 different client machines ( 15 RBD Volumes/ client) and running DD concurrently on top of each RBD. The issues are: 1. Slow requests ??From the list-archive it seems solved in 0.56.1 but we still notice such warning 2. OSD Down or even host down Like the message below.Seems some OSD has been blocking there for quite a long time. Suggestions are highly appreciate.Thanks Xiaoxi _ Bad news: I have back all my Ceph machine?s OS to kernel 3.2.0-23, which Ubuntu 12.04 use. I run dd command (dd if=/dev/zero bs=1M count=6 of=/dev/rbd${i} )on Ceph client to create data prepare test at last night. Now, I have one machine down (can?t be reached by ping), another two machine has all OSD daemon down, while the three left has some daemon down. I have many warnings in OSD log like this: no flag points reached 2013-01-15 19:14:22.769898 7f20a2d57700 0 log [WRN] : slow request 52.218106 seconds old, received at 2013-01-15 19:13:30.551718: osd_op(client.10674.1:1002417 rb.0.27a8.6b8b4567.0eba [write 3145728~524288] 2.c61810ee RETRY) currently waiting for sub ops 2013-01-15 19:14:23.770077 7f20a2d57700 0 log [WRN] : 21 slow requests, 6 included below; oldest blocked for 1132.138983 secs 2013-01-15 19:14:23.770086 7f20a2d57700 0 log [WRN] : slow request 53.216404 seconds old, received at 2013-01-15 19:13:30.553616: osd_op(client.10671.1:1066860 rb.0.282c.6b8b4567.1057 [write 2621440~524288] 2.ea7acebc) currently waiting for sub ops 2013-01-15 19:14:23.770096 7f20a2d57700 0 log [WRN] : slow request 51.442032 seconds old, received at 2013-01-15 19:13:32.327988: osd_op(client.10674.1:1002418 Similar info in dmesg we have saw pervious: [21199.036476] INFO: task ceph-osd:7788 blocked for more than 120 seconds. [21199.037493] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [21199.038841] ceph-osdD 0006 0 7788 1 0x [21199.038844] 880fefdafcc8 0086 ffe0 [21199.038848] 880fefdaffd8 880fefdaffd8 880fefdaffd8 00013780 [21199.038852] 88081aa58000 880f68f52de0 880f68f52de0 882017556200 [21199.038856] Call Trace: [21199.038858] [8165a55f] schedule+0x3f/0x60 [21199.038861] [8106b7e5] exit_mm+0x85/0x130 [21199.038864] [8106b9fe] do_exit+0x16e/0x420 [21199.038866] [8109d88f] ? __unqueue_futex+0x3f/0x80 [21199.038869] [8107a19a] ? __dequeue_signal+0x6a/0xb0 [21199.038872] [8106be54] do_group_exit+0x44/0xa0 [21199.038874] [8107ccdc] get_signal_to_deliver+0x21c/0x420 [21199.038877] [81013865] do_signal+0x45/0x130 [21199.038880] [810a091c] ? do_futex+0x7c/0x1b0 [21199.038882] [810a0b5a] ? sys_futex+0x10a/0x1a0 [21199.038885] [81013b15] do_notify_resume+0x65/0x80 [21199.038887] [81664d50] int_signal+0x12/0x17 We have seen this stack trace several times over the past 6 months, but are not sure what the trigger is. In principle, the ceph server-side daemons shouldn't be capable of locking up like this, but clearly something is amiss between what they are doing in userland and how the kernel is tolerating that. Low memory, perhaps? In each case where we tried to track it down, the problem seemed to go away on its own. Is this easily reproducible in your case? my 0.02$: http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg11531.html and kernel panic from two different hosts from yesterday during ceph startup(on 3.8-rc3, images from console available at http://imgur.com/wIRVn,k0QCS#0) leads to suggestion that Ceph may have been introduced lockup-alike behavior not a long ago, causing, in my case, excessive amount of context switches on the host leading to osd flaps and panic at the ip-ib stack due to same issue. For the stack trace my first guess would be a problem with the IB driver that is triggered by memory pressure. Can you characterize what the system utilization (CPU, memory) looks like leading up to the lockup? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a
Re: Ceph version 0.56.1, data loss on power failure
On Wed, 16 Jan 2013, Wido den Hollander wrote: On 01/16/2013 11:50 AM, Marcin Szukala wrote: Hi all, Any ideas how can I resolve my issue? Or where the problem is? Let me describe the issue. Host boots up and maps RBD image with XFS filesystems Host mounts the filesystems from the RBD image Host starts to write data to the mounted filesystems Host experiences power failure Host comes up and map the RBD image Host mounts the filesystems from the RBD image All data from all filesystems is lost Host is able to use the filesystems with no problems. Filesystem is XFS, no errors on filesystem, That simply does not make sense to me. How can all data be gone and the FS just mount cleanly. Can you try to format the RBD with EXT4 and see if that makes any difference. Could you also try to run a sync prior to pulling the power from the host to see if that makes any difference. A few other quick questions: What version of qemu and librbd are you using? What is the command line that is used to start the VM? This could be a problem with the qemu and librbd caching configuration. Thanks! sage Wido Kernel 3.5.0-19-generic root@openstack-1:/etc/init# ceph -s health HEALTH_OK monmap e1: 3 mons at {a=10.3.82.102:6789/0,b=10.3.82.103:6789/0,d=10.3.82.105:6789/0}, election epoch 10, quorum 0,1,2 a,b,d osdmap e132: 56 osds: 56 up, 56 in pgmap v87165: 13744 pgs: 13744 active+clean; 52727 MB data, 102 GB used, 52028 GB / 52131 GB avail mdsmap e1: 0/0/1 up Regards, Marcin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: REMINDER: all argonaut users should upgrade to v0.48.3argonaut
Thanks Sage! -- Regards, Sébastien Han. On Wed, Jan 16, 2013 at 5:39 PM, Sage Weil s...@inktank.com wrote: On Wed, 16 Jan 2013, S?bastien Han wrote: Can we use this doc as a reference for the upgrade? https://github.com/ceph/ceph/blob/eb02eaede53c03579d015ca00a888a48dbab739a/doc/install/upgrading-ceph.rst Yeah. It's pretty simple in this case (since it's a point release upgrade): - install new package everywhere - restart daemons (at any rate, in any order) The most important ones to upgrade in this case are the ceph-osd's. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph version 0.56.1, data loss on power failure
Op 16 jan. 2013 om 18:00 heeft Sage Weil s...@inktank.com het volgende geschreven: On Wed, 16 Jan 2013, Wido den Hollander wrote: On 01/16/2013 11:50 AM, Marcin Szukala wrote: Hi all, Any ideas how can I resolve my issue? Or where the problem is? Let me describe the issue. Host boots up and maps RBD image with XFS filesystems Host mounts the filesystems from the RBD image Host starts to write data to the mounted filesystems Host experiences power failure Host comes up and map the RBD image Host mounts the filesystems from the RBD image All data from all filesystems is lost Host is able to use the filesystems with no problems. Filesystem is XFS, no errors on filesystem, That simply does not make sense to me. How can all data be gone and the FS just mount cleanly. Can you try to format the RBD with EXT4 and see if that makes any difference. Could you also try to run a sync prior to pulling the power from the host to see if that makes any difference. A few other quick questions: What version of qemu and librbd are you using? What is the command line that is used to start the VM? This could be a problem with the qemu and librbd caching configuration. I don't think he uses Qemu. From what I understand he uses kernel RBD since he uses the words 'map' and 'unmap' Wido Thanks! sage Wido Kernel 3.5.0-19-generic root@openstack-1:/etc/init# ceph -s health HEALTH_OK monmap e1: 3 mons at {a=10.3.82.102:6789/0,b=10.3.82.103:6789/0,d=10.3.82.105:6789/0}, election epoch 10, quorum 0,1,2 a,b,d osdmap e132: 56 osds: 56 up, 56 in pgmap v87165: 13744 pgs: 13744 active+clean; 52727 MB data, 102 GB used, 52028 GB / 52131 GB avail mdsmap e1: 0/0/1 up Regards, Marcin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: flashcache
On 01/16/2013 03:46 PM, Sage Weil wrote: On Wed, 16 Jan 2013, Gandalf Corvotempesta wrote: 2013/1/16 Sage Weil s...@inktank.com: This sort of configuration effectively bundles the disk and SSD into a single unit, where the failure of either results in the loss of both. From Ceph's perspective, it doesn't matter if the thing it is sitting on is a single disk, an SSD+disk flashcache thing, or a big RAID array. All that changes is the probability of failure. Ok, it will fail, but this should not be an issue, in a cluster like ceph, right? With or without flashcache or SSD, ceph should be able to handle disks/nodes/osds failures on its own by replicating in real time to multiple server. Exactly. Should I worry about loosing data in case of failure? It should rebalance automatically in case of failure with no data loss. You should not worry, except to the extent that 2 might fail simultaneously, and failures in general are not good things. I would worry that there is a lot of stuff piling onto the SSD and it may become your bottleneck. My guess is that another 1-2 SSDs will be a better 'balance', but only experiementation will really tell us that. Otherwise, those seem to all be good things to put on teh SSD! I can't add more than 2 SSD, I don't have enough space. I can move OS to the first 2 spinning disks in raid1 software, if this will improve performance of SSD What about swap? I'm thinking to no use swap at all and start with 16/32GB RAM You could use the first (single) disk for os and logs. You might not even bother with raid1, since you will presumably be replicating across hosts. When the OSD disk dies, you can re-run your chef/juju/puppet rule or whatever provisioning tool is at work to reinstall/configure the OS disk. The data on the SSDs and data disks will all be intact. Other options might be network boot or even usb stick boot. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph slow request unstable issue
On Wed, Jan 16, 2013 at 10:35 PM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Jan 16, 2013 at 8:58 PM, Sage Weil s...@inktank.com wrote: Hi, On Wed, 16 Jan 2013, Andrey Korolyov wrote: On Wed, Jan 16, 2013 at 4:58 AM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote: Hi list, We are suffering from OSD or OS down when there is continuing high pressure on the Ceph rack. Basically we are on Ubuntu 12.04+ Ceph 0.56.1, 6 nodes, in each nodes with 20 * spindles + 4* SSDs as journal.(120 spindles in total) We create a lots of RBD volumes (say 240),mounting to 16 different client machines ( 15 RBD Volumes/ client) and running DD concurrently on top of each RBD. The issues are: 1. Slow requests ??From the list-archive it seems solved in 0.56.1 but we still notice such warning 2. OSD Down or even host down Like the message below.Seems some OSD has been blocking there for quite a long time. Suggestions are highly appreciate.Thanks Xiaoxi _ Bad news: I have back all my Ceph machine?s OS to kernel 3.2.0-23, which Ubuntu 12.04 use. I run dd command (dd if=/dev/zero bs=1M count=6 of=/dev/rbd${i} )on Ceph client to create data prepare test at last night. Now, I have one machine down (can?t be reached by ping), another two machine has all OSD daemon down, while the three left has some daemon down. I have many warnings in OSD log like this: no flag points reached 2013-01-15 19:14:22.769898 7f20a2d57700 0 log [WRN] : slow request 52.218106 seconds old, received at 2013-01-15 19:13:30.551718: osd_op(client.10674.1:1002417 rb.0.27a8.6b8b4567.0eba [write 3145728~524288] 2.c61810ee RETRY) currently waiting for sub ops 2013-01-15 19:14:23.770077 7f20a2d57700 0 log [WRN] : 21 slow requests, 6 included below; oldest blocked for 1132.138983 secs 2013-01-15 19:14:23.770086 7f20a2d57700 0 log [WRN] : slow request 53.216404 seconds old, received at 2013-01-15 19:13:30.553616: osd_op(client.10671.1:1066860 rb.0.282c.6b8b4567.1057 [write 2621440~524288] 2.ea7acebc) currently waiting for sub ops 2013-01-15 19:14:23.770096 7f20a2d57700 0 log [WRN] : slow request 51.442032 seconds old, received at 2013-01-15 19:13:32.327988: osd_op(client.10674.1:1002418 Similar info in dmesg we have saw pervious: [21199.036476] INFO: task ceph-osd:7788 blocked for more than 120 seconds. [21199.037493] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [21199.038841] ceph-osdD 0006 0 7788 1 0x [21199.038844] 880fefdafcc8 0086 ffe0 [21199.038848] 880fefdaffd8 880fefdaffd8 880fefdaffd8 00013780 [21199.038852] 88081aa58000 880f68f52de0 880f68f52de0 882017556200 [21199.038856] Call Trace: [21199.038858] [8165a55f] schedule+0x3f/0x60 [21199.038861] [8106b7e5] exit_mm+0x85/0x130 [21199.038864] [8106b9fe] do_exit+0x16e/0x420 [21199.038866] [8109d88f] ? __unqueue_futex+0x3f/0x80 [21199.038869] [8107a19a] ? __dequeue_signal+0x6a/0xb0 [21199.038872] [8106be54] do_group_exit+0x44/0xa0 [21199.038874] [8107ccdc] get_signal_to_deliver+0x21c/0x420 [21199.038877] [81013865] do_signal+0x45/0x130 [21199.038880] [810a091c] ? do_futex+0x7c/0x1b0 [21199.038882] [810a0b5a] ? sys_futex+0x10a/0x1a0 [21199.038885] [81013b15] do_notify_resume+0x65/0x80 [21199.038887] [81664d50] int_signal+0x12/0x17 We have seen this stack trace several times over the past 6 months, but are not sure what the trigger is. In principle, the ceph server-side daemons shouldn't be capable of locking up like this, but clearly something is amiss between what they are doing in userland and how the kernel is tolerating that. Low memory, perhaps? In each case where we tried to track it down, the problem seemed to go away on its own. Is this easily reproducible in your case? my 0.02$: http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg11531.html and kernel panic from two different hosts from yesterday during ceph startup(on 3.8-rc3, images from console available at http://imgur.com/wIRVn,k0QCS#0) leads to suggestion that Ceph may have been introduced lockup-alike behavior not a long ago, causing, in my case, excessive amount of context switches on the host leading to osd flaps and panic at the ip-ib stack due to same issue. For the stack trace my first guess would be a problem with the IB driver that is triggered by memory pressure. Can you characterize what the system utilization
Re: [PATCH REPOST 0/2] libceph: embed r_trail struct in ceph_osd_request()
On 01/03/2013 03:34 PM, Alex Elder wrote: This series simplifies some handling of osd client message handling by using an initialized ceph_pagelist structure to refer to the trail portion of a ceph_osd_request rather than using a null pointer to represent not there. -Alex [PATCH REPOST 1/2] libceph: always allow trail in osd request [PATCH REPOST 2/2] libceph: kill op_needs_trail() These look good. Reviewed-by: Josh Durgin josh.dur...@inktank.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH REPOST] rbd: separate layout init
On 01/03/2013 02:55 PM, Alex Elder wrote: Pull a block of code that initializes the layout structure in an osd request into its own function so it can be reused. Signed-off-by: Alex Elder el...@inktank.com --- Reviewed-by: Josh Durgin josh.dur...@inktank.com drivers/block/rbd.c | 23 ++- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index fd6a708..8e030d1 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -54,6 +54,7 @@ /* It might be useful to have this defined elsewhere too */ +#defineU32_MAX ((u32) (~0U)) #define U64_MAX ((u64) (~0ULL)) #define RBD_DRV_NAME rbd @@ -1096,6 +1097,16 @@ static void rbd_coll_end_req(struct rbd_request *rbd_req, ret, len); } +static void rbd_layout_init(struct ceph_file_layout *layout, u64 pool_id) +{ + memset(layout, 0, sizeof (*layout)); + layout-fl_stripe_unit = cpu_to_le32(1 RBD_MAX_OBJ_ORDER); + layout-fl_stripe_count = cpu_to_le32(1); + layout-fl_object_size = cpu_to_le32(1 RBD_MAX_OBJ_ORDER); + rbd_assert(pool_id = (u64) U32_MAX); + layout-fl_pg_pool = cpu_to_le32((u32) pool_id); +} + /* * Send ceph osd request */ @@ -1117,7 +1128,6 @@ static int rbd_do_request(struct request *rq, u64 *ver) { struct ceph_osd_request *osd_req; - struct ceph_file_layout *layout; int ret; u64 bno; struct timespec mtime = CURRENT_TIME; @@ -1161,14 +1171,9 @@ static int rbd_do_request(struct request *rq, strncpy(osd_req-r_oid, object_name, sizeof(osd_req-r_oid)); osd_req-r_oid_len = strlen(osd_req-r_oid); - layout = osd_req-r_file_layout; - memset(layout, 0, sizeof(*layout)); - layout-fl_stripe_unit = cpu_to_le32(1 RBD_MAX_OBJ_ORDER); - layout-fl_stripe_count = cpu_to_le32(1); - layout-fl_object_size = cpu_to_le32(1 RBD_MAX_OBJ_ORDER); - layout-fl_pg_pool = cpu_to_le32((int) rbd_dev-spec-pool_id); - ret = ceph_calc_raw_layout(osdc, layout, snapid, ofs, len, bno, - osd_req, ops); + rbd_layout_init(osd_req-r_file_layout, rbd_dev-spec-pool_id); + ret = ceph_calc_raw_layout(osdc, osd_req-r_file_layout, + snapid, ofs, len, bno, osd_req, ops); rbd_assert(ret == 0); ceph_osdc_build_request(osd_req, ofs, len, -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH REPOST 0/6] libceph: parameter cleanup
On 01/04/2013 06:31 AM, Alex Elder wrote: This series mostly cleans up parameters used by functions in libceph, in the osd client code. -Alex [PATCH REPOST 1/6] libceph: pass length to ceph_osdc_build_request() [PATCH REPOST 2/6] libceph: pass length to ceph_calc_file_object_mapping() [PATCH REPOST 3/6] libceph: drop snapid in ceph_calc_raw_layout() [PATCH REPOST 4/6] libceph: drop osdc from ceph_calc_raw_layout() [PATCH REPOST 5/6] libceph: don't set flags in ceph_osdc_alloc_request() [PATCH REPOST 6/6] libceph: don't set pages or bio in ceph_osdc_alloc_request() These all look good. Reviewed-by: Josh Durgin josh.dur...@inktank.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mds: first stab at lookup-by-ino problem/soln description
On Thu, Jan 17, 2013 at 5:52 AM, Gregory Farnum g...@inktank.com wrote: My biggest concern with this was how it worked on cluster with multiple data pools, and Sage's initial response was to either 1) create an object for each inode that lives in the metadata pool, and holds the backtraces (rather than putting them as attributes on the first object in the file), or 2) use a more sophisticated data structure, perhaps built on Eleanor's b-tree project from last summer (http://ceph.com/community/summer-adventures-with-ceph-building-a-b-tree/) I had thought that we could just query each data pool for the object, but Sage points out that 100-pool clusters aren't exactly unreasonable and that would take quite a lot of query time. And having the backtraces in the data pools significantly complicates things with our rules about setting layouts on new files. So this is going to need some kind of revision, please suggest alternatives! -Greg how about using DHT to map regular files to their parent directories, then use backtraces to find parent directory's path. Regards Yan, Zheng On Tue, Jan 15, 2013 at 3:35 PM, Sage Weil s...@inktank.com wrote: One of the first things we need to fix in the MDS is how we support lookup-by-ino. It's important for fsck, NFS reexport, and (insofar as there are limitations to the current anchor table design) hard links and snapshots. Below is a description of the problem and a rough sketch of my proposed solution. This is the first time I thought about the lookup algorithm in any detail, so I've probably missed something, and the 'ghost entries' bit is what came to mind on the plane. Hopefully we can think of something a bit lighter weight. Anyway, poke holes if anything isn't clear, if you have any better ideas, or if it's time to refine further. This is just a starting point for the conversation. The problem --- The MDS stores all fs metadata (files, inodes) in a hierarchy, allowing it to distribute responsibility among ceph-mds daemons by partitioning the namespace hierarchically. This is also a huge win for inode prefetching: loading the directory gets you both the names and the inodes in a single IO. One consequence of this is that we do not have a flat inode table that let's us look up files by inode number. We *can* find directories by ino simply because they are stored in an object named after the ino. However, we can't populate the cache this way because the metadata in cache must be fully attached to the root to avoid various forms of MDS anarchy. Lookup-by-ino is currently needed for hard links. The first link to a file is deemed the primary link, and that is where the inode is stored. Any additional links are internally remote links, and reference the inode by ino. However, there are other uses for lookup-by-ino, including NFS reexport and fsck. Anchor table The anchor table is currently used to locate inodes that have hard links. Inodes in the anchor table are said to be anchored, and can be found by ino alone with no knowledge of their path. Normally, only inodes that have hard links need to be anchored. There are a few other cases, but they are not relevant here. The anchor table is a flat table of records like: ino - (parent ino, hash(name), refcount) All parent ino's referenced in the table also have records. The refcount includes both other records listing a given ino as parent and the anchor itself (i.e., the inode). To anchor an inode, we insert records for the ino and all ancestors (if they are not already present). An anchor removal means decrementing the ino record. Once a refcount hits 0 it can be removed, and the parent ino's refcount can be decremented. A directory rename involves changing the parent ino value for an existing record, populating the new ancestors into the table (as needed), and decrementing the old parent's refcount. This all works great if there are a small number of anchors, but does not scale. The entire table is managed by a single MDS, and is currently kept in memory. We do not want to anchor every inode in the system or this is impractical. But, be want lookup-by-ino for NFS reexport, and something similar/related for fsck. Current lookup by ino procedure --- :: lookup_ino(ino) send message mds.N - mds.0 anchor lookup $ino get reply message mds.0 - mds.N reply contains record for $ino and all ancestors (an anchor trace) parent = depest ancestor in trace that we have in our cache while parent != ino child = parent.lookup(hash(name)) if not found restart from the top parent = child Directory backpointers -- There is partial infrastructure for supporting fsck that is already maintained for directories. Each directory object (the first object for the directory, if there are multiple
Re: mds: first stab at lookup-by-ino problem/soln description
On Wed, Jan 16, 2013 at 3:54 PM, Sam Lang sam.l...@inktank.com wrote: On Wed, Jan 16, 2013 at 3:52 PM, Gregory Farnum g...@inktank.com wrote: My biggest concern with this was how it worked on cluster with multiple data pools, and Sage's initial response was to either 1) create an object for each inode that lives in the metadata pool, and holds the backtraces (rather than putting them as attributes on the first object in the file), or 2) use a more sophisticated data structure, perhaps built on Eleanor's b-tree project from last summer (http://ceph.com/community/summer-adventures-with-ceph-building-a-b-tree/) I had thought that we could just query each data pool for the object, but Sage points out that 100-pool clusters aren't exactly unreasonable and that would take quite a lot of query time. And having the backtraces in the data pools significantly complicates things with our rules about setting layouts on new files. So this is going to need some kind of revision, please suggest alternatives! Correct me if I'm wrong, but this seems like its only an issue in the NFS reexport case, as fsck can walk through the data objects in each pool (in parallel?) and verify back/forward consistency, so we won't have to guess which pool an ino is in. Given that, if we could stuff the pool id in the ino for the file returned through the client interfaces, then we wouldn't have to guess. -sam I'm not familiar with the interfaces at work there. Do we have a free 32 bits we can steal in order to do that stuffing? (I *think* it would go in the NFS filehandle structure rather than the ino, right?) We would need to also store that information in order to eventually replace the anchor table, but of course that's much easier to deal with. If we can just do it this way, that still leaves handling files which don't have any data written yet — under our current system, users can apply a data layout to any inode which has not had data written to it yet. Unfortunately that gets hard to deal with if a user touches a bunch of files and then comes back to place them the next day. :/ I suppose un-touched files could have the special property that their lookup data is stored in the metadata pool and it gets moved as soon as they have data — in the typical case files are written right away and so this wouldn't be any more writes, just a bit more logic. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/29] Various fixes for MDS
Hi Yan, I reviewed these on the plane last night and they look good. There was one small cleanup I pushed on top of wip-mds (in ceph.git). I'll run this through our (still limited) fs suite and then merge into master. Thanks! sage On Fri, 4 Jan 2013, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com This patch series fix various issues I encountered when running 3 MDS. I test this patch series by runing fsstress on two clients, using the same test directory. The MDS and clients could survived overnight test at times. This patch series are also in: git://github.com/ukernel/ceph.git wip-mds -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Ceph slow request unstable issue
On Thu, 17 Jan 2013, Chen, Xiaoxi wrote: Hi Sage? Both CPU and Memory utilization are very low. CPU is ~ 20% (with 60% IOWAIT), Memory is far more less . I have 32 Core Sandybridege CPU(64 Core for HT), together with 128GB RAM per node. Hmm! -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: 2013?1?17? 0:59 To: Andrey Korolyov Cc: Chen, Xiaoxi; ceph-devel@vger.kernel.org Subject: Re: Ceph slow request unstable issue Hi, On Wed, 16 Jan 2013, Andrey Korolyov wrote: On Wed, Jan 16, 2013 at 4:58 AM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote: Hi list, We are suffering from OSD or OS down when there is continuing high pressure on the Ceph rack. Basically we are on Ubuntu 12.04+ Ceph 0.56.1, 6 nodes, in each nodes with 20 * spindles + 4* SSDs as journal.(120 spindles in total) We create a lots of RBD volumes (say 240),mounting to 16 different client machines ( 15 RBD Volumes/ client) and running DD concurrently on top of each RBD. The issues are: 1. Slow requests ??From the list-archive it seems solved in 0.56.1 but we still notice such warning 2. OSD Down or even host down Like the message below.Seems some OSD has been blocking there for quite a long time. There is still an issue with throttling recovery/migration traffic leading to the slow requests that should be fixed shortly. Suggestions are highly appreciate.Thanks Xiaoxi _ Bad news: I have back all my Ceph machine?s OS to kernel 3.2.0-23, which Ubuntu 12.04 use. I run dd command (dd if=/dev/zero bs=1M count=6 of=/dev/rbd${i} )on Ceph client to create data prepare test at last night. Oooh, you are running the kernel RBD client on a 3.2 kernel. There have been a long series of fixes since then, but we've only backported as far back as 3.4. Can you try a newer kernel version for the client? Something a recnet 3.4 or 3.7 series, like 3.7.2 or 3.4.25... Thanks! Now, I have one machine down (can?t be reached by ping), another two machine has all OSD daemon down, while the three left has some daemon down. I have many warnings in OSD log like this: no flag points reached 2013-01-15 19:14:22.769898 7f20a2d57700 0 log [WRN] : slow request 52.218106 seconds old, received at 2013-01-15 19:13:30.551718: osd_op(client.10674.1:1002417 rb.0.27a8.6b8b4567.0eba [write 3145728~524288] 2.c61810ee RETRY) currently waiting for sub ops 2013-01-15 19:14:23.770077 7f20a2d57700 0 log [WRN] : 21 slow requests, 6 included below; oldest blocked for 1132.138983 secs 2013-01-15 19:14:23.770086 7f20a2d57700 0 log [WRN] : slow request 53.216404 seconds old, received at 2013-01-15 19:13:30.553616: osd_op(client.10671.1:1066860 rb.0.282c.6b8b4567.1057 [write 2621440~524288] 2.ea7acebc) currently waiting for sub ops 2013-01-15 19:14:23.770096 7f20a2d57700 0 log [WRN] : slow request 51.442032 seconds old, received at 2013-01-15 19:13:32.327988: osd_op(client.10674.1:1002418 Similar info in dmesg we have saw pervious: [21199.036476] INFO: task ceph-osd:7788 blocked for more than 120 seconds. [21199.037493] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [21199.038841] ceph-osdD 0006 0 7788 1 0x [21199.038844] 880fefdafcc8 0086 ffe0 [21199.038848] 880fefdaffd8 880fefdaffd8 880fefdaffd8 00013780 [21199.038852] 88081aa58000 880f68f52de0 880f68f52de0 882017556200 [21199.038856] Call Trace: [21199.038858] [8165a55f] schedule+0x3f/0x60 [21199.038861] [8106b7e5] exit_mm+0x85/0x130 [21199.038864] [8106b9fe] do_exit+0x16e/0x420 [21199.038866] [8109d88f] ? __unqueue_futex+0x3f/0x80 [21199.038869] [8107a19a] ? __dequeue_signal+0x6a/0xb0 [21199.038872] [8106be54] do_group_exit+0x44/0xa0 [21199.038874] [8107ccdc] get_signal_to_deliver+0x21c/0x420 [21199.038877] [81013865] do_signal+0x45/0x130 [21199.038880] [810a091c] ? do_futex+0x7c/0x1b0 [21199.038882] [810a0b5a] ? sys_futex+0x10a/0x1a0 [21199.038885] [81013b15] do_notify_resume+0x65/0x80 [21199.038887] [81664d50] int_signal+0x12/0x17 We have seen this stack trace several times over the past 6 months, but are not sure what the trigger is. In principle, the ceph server-side daemons shouldn't be capable of locking up like this, but clearly something is amiss between what they are doing in userland and
Re: [PATCH REPOST 0/4] rbd: explicitly support only one osd op
On 01/04/2013 06:43 AM, Alex Elder wrote: An osd request can be made up of multiple ops, all of which are completed (or not) transactionally. There is partial support for multiple ops in an rbd request in the rbd code, but it's incomplete and not even supported by the osd client or the messenger right now. I see three problems with this partial implementation: it gives a false impression of how things work; it complicates some code in some cases where it's not necessary; and it may constrain how one might pursue fully implementing multiple ops in a request to ways that don't fit well with how we want to do things. So this series just simplifies things, making it explicit that there is only one op in an kernel osd client request right now. -Alex [PATCH REPOST 1/4] rbd: pass num_op with ops array [PATCH REPOST 2/4] libceph: pass num_op with ops [PATCH REPOST 3/4] rbd: there is really only one op [PATCH REPOST 4/4] rbd: assume single op in a request These look good. Reviewed-by: Josh Durgin josh.dur...@inktank.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH REPOST 0/3] rbd: no need for file mapping calculation
On 01/04/2013 06:51 AM, Alex Elder wrote: Currently every osd request submitted by the rbd code undergoes a file mapping operation, which is common with what the ceph file system uses. But some analysis shows that there is no need to do this for rbd, because it already takes care of its own blocking of image data into distinct objects. Removing this simplifies things. I especially think removing this improves things conceptually, removing a complex mapping operation from the I/O path. -Alex [PATCH REPOST 1/3] rbd: pull in ceph_calc_raw_layout() [PATCH REPOST 2/3] rbd: open code rbd_calc_raw_layout() [PATCH REPOST 3/3] rbd: don't bother calculating file mapping We'll want to use similar methods later for fancier rbd striping with format 2 images, but that'll take more restructuring later anyway. This is fine for now. Reviewed-by: Josh Durgin josh.dur...@inktank.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH REPOST] rbd: kill ceph_osd_req_op-flags
On 01/04/2013 06:46 AM, Alex Elder wrote: The flags field of struct ceph_osd_req_op is never used, so just get rid of it. Signed-off-by: Alex Elder el...@inktank.com --- Reviewed-by: Josh Durgin josh.dur...@inktank.com include/linux/ceph/osd_client.h |1 - 1 file changed, 1 deletion(-) diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 2b04d05..69287cc 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -157,7 +157,6 @@ struct ceph_osd_client { struct ceph_osd_req_op { u16 op; /* CEPH_OSD_OP_* */ - u32 flags;/* CEPH_OSD_FLAG_* */ union { struct { u64 offset, length; -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH REPOST] rbd: use a common layout for each device
On 01/04/2013 06:54 AM, Alex Elder wrote: Each osd message includes a layout structure, and for rbd it is always the same (at least for osd's in a given pool). Initialize a layout structure when an rbd_dev gets created and just copy that into osd requests for the rbd image. Replace an assertion that was done when initializing the layout structures with code that catches and handles anything that would trigger the assertion as soon as it is identified. This precludes that (bad) condition from ever occurring. Signed-off-by: Alex Elder el...@inktank.com --- Reviewed-by: Josh Durgin josh.dur...@inktank.com drivers/block/rbd.c | 34 +++--- 1 file changed, 23 insertions(+), 11 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 072608e..7c35608 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -235,6 +235,8 @@ struct rbd_device { char*header_name; + struct ceph_file_layout layout; + struct ceph_osd_event *watch_event; struct ceph_osd_request *watch_request; @@ -1091,16 +1093,6 @@ static void rbd_coll_end_req(struct rbd_request *rbd_req, ret, len); } -static void rbd_layout_init(struct ceph_file_layout *layout, u64 pool_id) -{ - memset(layout, 0, sizeof (*layout)); - layout-fl_stripe_unit = cpu_to_le32(1 RBD_MAX_OBJ_ORDER); - layout-fl_stripe_count = cpu_to_le32(1); - layout-fl_object_size = cpu_to_le32(1 RBD_MAX_OBJ_ORDER); - rbd_assert(pool_id = (u64) U32_MAX); - layout-fl_pg_pool = cpu_to_le32((u32) pool_id); -} - /* * Send ceph osd request */ @@ -1165,7 +1157,7 @@ static int rbd_do_request(struct request *rq, strncpy(osd_req-r_oid, object_name, sizeof(osd_req-r_oid)); osd_req-r_oid_len = strlen(osd_req-r_oid); - rbd_layout_init(osd_req-r_file_layout, rbd_dev-spec-pool_id); + osd_req-r_file_layout = rbd_dev-layout; /* struct */ if (op-op == CEPH_OSD_OP_READ || op-op == CEPH_OSD_OP_WRITE) { op-extent.offset = ofs; @@ -2295,6 +2287,13 @@ struct rbd_device *rbd_dev_create(struct rbd_client *rbdc, rbd_dev-spec = spec; rbd_dev-rbd_client = rbdc; + /* Initialize the layout used for all rbd requests */ + + rbd_dev-layout.fl_stripe_unit = cpu_to_le32(1 RBD_MAX_OBJ_ORDER); + rbd_dev-layout.fl_stripe_count = cpu_to_le32(1); + rbd_dev-layout.fl_object_size = cpu_to_le32(1 RBD_MAX_OBJ_ORDER); + rbd_dev-layout.fl_pg_pool = cpu_to_le32((u32) spec-pool_id); + return rbd_dev; } @@ -2549,6 +2548,12 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev) if (parent_spec-pool_id == CEPH_NOPOOL) goto out; /* No parent? No problem. */ + /* The ceph file layout needs to fit pool id in 32 bits */ + + ret = -EIO; + if (WARN_ON(parent_spec-pool_id (u64) U32_MAX)) + goto out; + image_id = ceph_extract_encoded_string(p, end, NULL, GFP_KERNEL); if (IS_ERR(image_id)) { ret = PTR_ERR(image_id); @@ -3678,6 +3683,13 @@ static ssize_t rbd_add(struct bus_type *bus, goto err_out_client; spec-pool_id = (u64) rc; + /* The ceph file layout needs to fit pool id in 32 bits */ + + if (WARN_ON(spec-pool_id (u64) U32_MAX)) { + rc = -EIO; + goto err_out_client; + } + rbd_dev = rbd_dev_create(rbdc, spec); if (!rbd_dev) goto err_out_client; -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH REPOST] rbd: combine rbd sync watch/unwatch functions
On 01/04/2013 06:55 AM, Alex Elder wrote: The rbd_req_sync_watch() and rbd_req_sync_unwatch() functions are nearly identical. Combine them into a single function with a flag indicating whether a watch is to be initiated or torn down. Signed-off-by: Alex Elder el...@inktank.com --- Reviewed-by: Josh Durgin josh.dur...@inktank.com drivers/block/rbd.c | 81 +-- 1 file changed, 27 insertions(+), 54 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 7c35608..c1e5f24 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1429,74 +1429,48 @@ static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data) } /* - * Request sync osd watch + * Request sync osd watch/unwatch. The value of start determines + * whether a watch request is being initiated or torn down. */ -static int rbd_req_sync_watch(struct rbd_device *rbd_dev) +static int rbd_req_sync_watch(struct rbd_device *rbd_dev, int start) { struct ceph_osd_req_op *op; - struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc; + struct ceph_osd_request **linger_req = NULL; + __le64 version = 0; int ret; op = rbd_create_rw_op(CEPH_OSD_OP_WATCH, 0); if (!op) return -ENOMEM; - ret = ceph_osdc_create_event(osdc, rbd_watch_cb, 0, -(void *)rbd_dev, rbd_dev-watch_event); - if (ret 0) - goto fail; - - op-watch.ver = cpu_to_le64(rbd_dev-header.obj_version); - op-watch.cookie = cpu_to_le64(rbd_dev-watch_event-cookie); - op-watch.flag = 1; - - ret = rbd_req_sync_op(rbd_dev, - CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ONDISK, - op, - rbd_dev-header_name, - 0, 0, NULL, - rbd_dev-watch_request, NULL); - - if (ret 0) - goto fail_event; - - rbd_destroy_op(op); - return 0; - -fail_event: - ceph_osdc_cancel_event(rbd_dev-watch_event); - rbd_dev-watch_event = NULL; -fail: - rbd_destroy_op(op); - return ret; -} - -/* - * Request sync osd unwatch - */ -static int rbd_req_sync_unwatch(struct rbd_device *rbd_dev) -{ - struct ceph_osd_req_op *op; - int ret; + if (start) { + struct ceph_osd_client *osdc; - op = rbd_create_rw_op(CEPH_OSD_OP_WATCH, 0); - if (!op) - return -ENOMEM; + osdc = rbd_dev-rbd_client-client-osdc; + ret = ceph_osdc_create_event(osdc, rbd_watch_cb, 0, rbd_dev, + rbd_dev-watch_event); + if (ret 0) + goto done; + version = cpu_to_le64(rbd_dev-header.obj_version); + linger_req = rbd_dev-watch_request; + } - op-watch.ver = 0; + op-watch.ver = version; op-watch.cookie = cpu_to_le64(rbd_dev-watch_event-cookie); - op-watch.flag = 0; + op-watch.flag = (u8) start ? 1 : 0; ret = rbd_req_sync_op(rbd_dev, CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ONDISK, - op, - rbd_dev-header_name, - 0, 0, NULL, NULL, NULL); - + op, rbd_dev-header_name, + 0, 0, NULL, linger_req, NULL); + if (!start || ret 0) { + ceph_osdc_cancel_event(rbd_dev-watch_event); + rbd_dev-watch_event = NULL; + } +done: rbd_destroy_op(op); - ceph_osdc_cancel_event(rbd_dev-watch_event); - rbd_dev-watch_event = NULL; + return ret; } @@ -3031,7 +3005,7 @@ static int rbd_init_watch_dev(struct rbd_device *rbd_dev) int ret, rc; do { - ret = rbd_req_sync_watch(rbd_dev); + ret = rbd_req_sync_watch(rbd_dev, 1); if (ret == -ERANGE) { rc = rbd_dev_refresh(rbd_dev, NULL); if (rc 0) @@ -3750,8 +3724,7 @@ static void rbd_dev_release(struct device *dev) rbd_dev-watch_request); } if (rbd_dev-watch_event) - rbd_req_sync_unwatch(rbd_dev); - + rbd_req_sync_watch(rbd_dev, 0); /* clean up and free blkdev */ rbd_free_disk(rbd_dev); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH REPOST 6/6] rbd: move remaining osd op setup into rbd_osd_req_op_create()
On 01/04/2013 07:07 AM, Alex Elder wrote: The two remaining osd ops used by rbd are CEPH_OSD_OP_WATCH and CEPH_OSD_OP_NOTIFY_ACK. Move the setup of those operations into rbd_osd_req_op_create(), and get rid of rbd_create_rw_op() and rbd_destroy_op(). Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 68 --- 1 file changed, 27 insertions(+), 41 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 9f41c32..21fbf82 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1027,24 +1027,6 @@ out_err: return NULL; } -static struct ceph_osd_req_op *rbd_create_rw_op(int opcode, u64 ofs, u64 len) -{ - struct ceph_osd_req_op *op; - - op = kzalloc(sizeof (*op), GFP_NOIO); - if (!op) - return NULL; - - op-op = opcode; - - return op; -} - -static void rbd_destroy_op(struct ceph_osd_req_op *op) -{ - kfree(op); -} - struct ceph_osd_req_op *rbd_osd_req_op_create(u16 opcode, ...) { struct ceph_osd_req_op *op; @@ -1087,6 +1069,16 @@ struct ceph_osd_req_op *rbd_osd_req_op_create(u16 opcode, ...) op-cls.indata_len = (u32) size; op-payload_len += size; break; + case CEPH_OSD_OP_NOTIFY_ACK: + case CEPH_OSD_OP_WATCH: + /* rbd_osd_req_op_create(NOTIFY_ACK, cookie, version) */ + /* rbd_osd_req_op_create(WATCH, cookie, version, flag) */ + op-watch.cookie = va_arg(args, u64); + op-watch.ver = va_arg(args, u64); + op-watch.ver = cpu_to_le64(op-watch.ver); /* XXX */ why the /* XXX */ comment? + if (opcode == CEPH_OSD_OP_WATCH va_arg(args, int)) + op-watch.flag = (u8) 1; + break; default: rbd_warn(NULL, unsupported opcode %hu\n, opcode); kfree(op); @@ -1434,14 +1426,10 @@ static int rbd_req_sync_notify_ack(struct rbd_device *rbd_dev, struct ceph_osd_req_op *op; int ret; - op = rbd_create_rw_op(CEPH_OSD_OP_NOTIFY_ACK, 0, 0); + op = rbd_osd_req_op_create(CEPH_OSD_OP_NOTIFY_ACK, notify_id, ver); if (!op) return -ENOMEM; - op-watch.ver = cpu_to_le64(ver); - op-watch.cookie = notify_id; - op-watch.flag = 0; - ret = rbd_do_request(NULL, rbd_dev, NULL, CEPH_NOSNAP, rbd_dev-header_name, 0, 0, NULL, NULL, 0, @@ -1450,7 +1438,8 @@ static int rbd_req_sync_notify_ack(struct rbd_device *rbd_dev, NULL, 0, rbd_simple_req_cb, 0, NULL); - rbd_destroy_op(op); + rbd_osd_req_op_destroy(op); + return ret; } @@ -1480,14 +1469,9 @@ static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data) */ static int rbd_req_sync_watch(struct rbd_device *rbd_dev, int start) { - struct ceph_osd_req_op *op; struct ceph_osd_request **linger_req = NULL; - __le64 version = 0; - int ret; - - op = rbd_create_rw_op(CEPH_OSD_OP_WATCH, 0, 0); - if (!op) - return -ENOMEM; + struct ceph_osd_req_op *op; + int ret = 0; if (start) { struct ceph_osd_client *osdc; @@ -1496,26 +1480,28 @@ static int rbd_req_sync_watch(struct rbd_device *rbd_dev, int start) ret = ceph_osdc_create_event(osdc, rbd_watch_cb, 0, rbd_dev, rbd_dev-watch_event); if (ret 0) - goto done; - version = cpu_to_le64(rbd_dev-header.obj_version); + return ret; linger_req = rbd_dev-watch_request; + } else { + rbd_assert(rbd_dev-watch_request != NULL); } - op-watch.ver = version; - op-watch.cookie = cpu_to_le64(rbd_dev-watch_event-cookie); - op-watch.flag = (u8) start ? 1 : 0; - - ret = rbd_req_sync_op(rbd_dev, + op = rbd_osd_req_op_create(CEPH_OSD_OP_WATCH, + rbd_dev-watch_event-cookie, + rbd_dev-header.obj_version, start); + if (op) + ret = rbd_req_sync_op(rbd_dev, CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ONDISK, op, rbd_dev-header_name, 0, 0, NULL, linger_req, NULL); - if (!start || ret 0) { + /* Cancel the event if we're tearing down, or on error */ + + if (!start || !op || ret 0) { ceph_osdc_cancel_event(rbd_dev-watch_event); rbd_dev-watch_event = NULL; } -done: - rbd_destroy_op(op); + rbd_osd_req_op_destroy(op); return ret; } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org
Re: [PATCH REPOST 0/6] rbd: consolidate osd request setup
On 01/04/2013 07:03 AM, Alex Elder wrote: This series consolidates and encapsulates the setup of all osd requests into a single function which takes variable arguments appropriate for the type of request. The result groups together common code idioms and I think makes the spots that build these messages a little easier to read. -Alex [PATCH REPOST 1/6] rbd: don't assign extent info in rbd_do_request() [PATCH REPOST 2/6] rbd: don't assign extent info in rbd_req_sync_op() [PATCH REPOST 3/6] rbd: initialize off and len in rbd_create_rw_op() [PATCH REPOST 4/6] rbd: define generalized osd request op routines [PATCH REPOST 5/6] rbd: move call osd op setup into rbd_osd_req_op_create() [PATCH REPOST 6/6] rbd: move remaining osd op setup into rbd_osd_req_op_create() I'm not sure about the varargs approach. It makes it easy to accidentally use the wrong parameters. What do you think about replacing calls to rbd_osd_req_create_op() with helpers for the various kinds of requests that just call rbd_osd_req_create_op() themselves, so that the arguments can be checked at compile time? This will probably be more of an issue with multi-op osd requests in the future. Eventually I think all this osd-request-related stuff should go into libceph, but that's a cleanup for another day. In any case, the new structure looks good to me. Reviewed-by: Josh Durgin josh.dur...@inktank.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH REPOST] rbd: assign watch request more directly
On 01/04/2013 07:07 AM, Alex Elder wrote: Both rbd_req_sync_op() and rbd_do_request() have a linger parameter, which is the address of a pointer that should refer to the osd request structure used to issue a request to an osd. Only one case ever supplies a non-null linger argument: an CEPH_OSD_OP_WATCH start. And in that one case it is assigned rbd_dev-watch_request. Within rbd_do_request() (where the assignment ultimately gets made) we know the rbd_dev and therefore its watch_request field. We also know whether the op being sent is CEPH_OSD_OP_WATCH start. Stop opaquely passing down the linger pointer, and instead just assign the value directly inside rbd_do_request() when it's needed. This makes it unnecessary for rbd_req_sync_watch() to make arrangements to hold a value that's not available until a bit later. This more clearly separates setting up a watch request from submitting it. Signed-off-by: Alex Elder el...@inktank.com --- Reviewed-by: Josh Durgin josh.dur...@inktank.com drivers/block/rbd.c | 20 1 file changed, 8 insertions(+), 12 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 21fbf82..02002b1 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1158,7 +1158,6 @@ static int rbd_do_request(struct request *rq, int coll_index, void (*rbd_cb)(struct ceph_osd_request *, struct ceph_msg *), - struct ceph_osd_request **linger_req, u64 *ver) { struct ceph_osd_client *osdc; @@ -1210,9 +1209,9 @@ static int rbd_do_request(struct request *rq, ceph_osdc_build_request(osd_req, ofs, len, 1, op, snapc, snapid, mtime); - if (linger_req) { + if (op-op == CEPH_OSD_OP_WATCH op-watch.flag) { ceph_osdc_set_request_linger(osdc, osd_req); - *linger_req = osd_req; + rbd_dev-watch_request = osd_req; } ret = ceph_osdc_start_request(osdc, osd_req, false); @@ -1296,7 +1295,6 @@ static int rbd_req_sync_op(struct rbd_device *rbd_dev, const char *object_name, u64 ofs, u64 inbound_size, char *inbound, - struct ceph_osd_request **linger_req, u64 *ver) { int ret; @@ -1317,7 +1315,7 @@ static int rbd_req_sync_op(struct rbd_device *rbd_dev, op, NULL, 0, NULL, - linger_req, ver); + ver); if (ret 0) goto done; @@ -1383,7 +1381,7 @@ static int rbd_do_op(struct request *rq, flags, op, coll, coll_index, -rbd_req_cb, 0, NULL); +rbd_req_cb, NULL); if (ret 0) rbd_coll_end_req_index(rq, coll, coll_index, (s32) ret, seg_len); @@ -1410,7 +1408,7 @@ static int rbd_req_sync_read(struct rbd_device *rbd_dev, return -ENOMEM; ret = rbd_req_sync_op(rbd_dev, CEPH_OSD_FLAG_READ, - op, object_name, ofs, len, buf, NULL, ver); + op, object_name, ofs, len, buf, ver); rbd_osd_req_op_destroy(op); return ret; @@ -1436,7 +1434,7 @@ static int rbd_req_sync_notify_ack(struct rbd_device *rbd_dev, CEPH_OSD_FLAG_READ, op, NULL, 0, - rbd_simple_req_cb, 0, NULL); + rbd_simple_req_cb, NULL); rbd_osd_req_op_destroy(op); @@ -1469,7 +1467,6 @@ static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data) */ static int rbd_req_sync_watch(struct rbd_device *rbd_dev, int start) { - struct ceph_osd_request **linger_req = NULL; struct ceph_osd_req_op *op; int ret = 0; @@ -1481,7 +1478,6 @@ static int rbd_req_sync_watch(struct rbd_device *rbd_dev, int start) rbd_dev-watch_event); if (ret 0) return ret; - linger_req = rbd_dev-watch_request; } else { rbd_assert(rbd_dev-watch_request != NULL); } @@ -1493,7 +1489,7 @@ static int rbd_req_sync_watch(struct rbd_device *rbd_dev, int start) ret = rbd_req_sync_op(rbd_dev, CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ONDISK, op, rbd_dev-header_name, - 0, 0, NULL, linger_req, NULL); + 0, 0, NULL, NULL); /* Cancel the event if we're tearing down, or on error */ @@
Re: [PATCH REPOST 6/6] rbd: move remaining osd op setup into rbd_osd_req_op_create()
On 01/16/2013 10:23 PM, Josh Durgin wrote: On 01/04/2013 07:07 AM, Alex Elder wrote: The two remaining osd ops used by rbd are CEPH_OSD_OP_WATCH and CEPH_OSD_OP_NOTIFY_ACK. Move the setup of those operations into rbd_osd_req_op_create(), and get rid of rbd_create_rw_op() and rbd_destroy_op(). Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 68 --- 1 file changed, 27 insertions(+), 41 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 9f41c32..21fbf82 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1027,24 +1027,6 @@ out_err: return NULL; } -static struct ceph_osd_req_op *rbd_create_rw_op(int opcode, u64 ofs, u64 len) -{ -struct ceph_osd_req_op *op; - -op = kzalloc(sizeof (*op), GFP_NOIO); -if (!op) -return NULL; - -op-op = opcode; - -return op; -} - -static void rbd_destroy_op(struct ceph_osd_req_op *op) -{ -kfree(op); -} - struct ceph_osd_req_op *rbd_osd_req_op_create(u16 opcode, ...) { struct ceph_osd_req_op *op; @@ -1087,6 +1069,16 @@ struct ceph_osd_req_op *rbd_osd_req_op_create(u16 opcode, ...) op-cls.indata_len = (u32) size; op-payload_len += size; break; +case CEPH_OSD_OP_NOTIFY_ACK: +case CEPH_OSD_OP_WATCH: +/* rbd_osd_req_op_create(NOTIFY_ACK, cookie, version) */ +/* rbd_osd_req_op_create(WATCH, cookie, version, flag) */ +op-watch.cookie = va_arg(args, u64); +op-watch.ver = va_arg(args, u64); +op-watch.ver = cpu_to_le64(op-watch.ver);/* XXX */ why the /* XXX */ comment? Because it's the only value here that is converted from cpu byte order. It was added in this commit: a71b891bc7d77a070e723c8c53d1dd73cf931555 rbd: send header version when notifying And I think was done without full understanding that it was being done different from all the others. I think it may be wrong but I haven't really looked at it yet. Pulling them all into this function made this difference more obvious. It was a note to self that I wanted to fix that. I normally try to resolve anything like that before I post for review but I guess I forgot. There may be others. -Alex +if (opcode == CEPH_OSD_OP_WATCH va_arg(args, int)) +op-watch.flag = (u8) 1; +break; default: rbd_warn(NULL, unsupported opcode %hu\n, opcode); kfree(op); @@ -1434,14 +1426,10 @@ static int rbd_req_sync_notify_ack(struct rbd_device *rbd_dev, struct ceph_osd_req_op *op; int ret; -op = rbd_create_rw_op(CEPH_OSD_OP_NOTIFY_ACK, 0, 0); +op = rbd_osd_req_op_create(CEPH_OSD_OP_NOTIFY_ACK, notify_id, ver); if (!op) return -ENOMEM; -op-watch.ver = cpu_to_le64(ver); -op-watch.cookie = notify_id; -op-watch.flag = 0; - ret = rbd_do_request(NULL, rbd_dev, NULL, CEPH_NOSNAP, rbd_dev-header_name, 0, 0, NULL, NULL, 0, @@ -1450,7 +1438,8 @@ static int rbd_req_sync_notify_ack(struct rbd_device *rbd_dev, NULL, 0, rbd_simple_req_cb, 0, NULL); -rbd_destroy_op(op); +rbd_osd_req_op_destroy(op); + return ret; } @@ -1480,14 +1469,9 @@ static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data) */ static int rbd_req_sync_watch(struct rbd_device *rbd_dev, int start) { -struct ceph_osd_req_op *op; struct ceph_osd_request **linger_req = NULL; -__le64 version = 0; -int ret; - -op = rbd_create_rw_op(CEPH_OSD_OP_WATCH, 0, 0); -if (!op) -return -ENOMEM; +struct ceph_osd_req_op *op; +int ret = 0; if (start) { struct ceph_osd_client *osdc; @@ -1496,26 +1480,28 @@ static int rbd_req_sync_watch(struct rbd_device *rbd_dev, int start) ret = ceph_osdc_create_event(osdc, rbd_watch_cb, 0, rbd_dev, rbd_dev-watch_event); if (ret 0) -goto done; -version = cpu_to_le64(rbd_dev-header.obj_version); +return ret; linger_req = rbd_dev-watch_request; +} else { +rbd_assert(rbd_dev-watch_request != NULL); } -op-watch.ver = version; -op-watch.cookie = cpu_to_le64(rbd_dev-watch_event-cookie); -op-watch.flag = (u8) start ? 1 : 0; - -ret = rbd_req_sync_op(rbd_dev, +op = rbd_osd_req_op_create(CEPH_OSD_OP_WATCH, +rbd_dev-watch_event-cookie, +rbd_dev-header.obj_version, start); +if (op) +ret = rbd_req_sync_op(rbd_dev, CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ONDISK, op, rbd_dev-header_name, 0, 0, NULL, linger_req, NULL); -if (!start || ret 0) { +/* Cancel the event if we're tearing down, or
Re: flashcache
Hi Mark, Am 16.01.2013 um 22:53 schrieb Mark With only 2 SSDs for 12 spinning disks, you'll need to make sure the SSDs are really fast. I use Intel 520s for testing which are great, but I wouldn't use them in production. Why not? I use them for a ssd only ceph cluster. Stefan-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Ceph slow request unstable issue
Some update summary for tested case till now: Ceph is v0.56.1 1. RBD:Ubuntu 13.04 + 3.7Kernel OSD:Ubuntu 13.04 + 3.7Kernel XFS Result: Kernel Panic on both RBD and OSD sides 2. RBD:Ubuntu 13.04 +3.2Kernel OSD:Ubuntu 13.04 +3.2Kernel XFS Result:Kernel Panic on RBD( ~15Minus) 3. RBD:Ubuntu 13.04 + 3.6.7 Kernel (Suggested by Ceph.com) OSD:Ubuntu 13.04 + 3.2 Kernel XFS Result: Auto-reset on OSD ( ~ 30 mins after the test started) 4. RBD:Ubuntu 13.04+3.6.7 Kernel (Suggested by Ceph.com) OSD:Ubuntu 12.04 + 3.2.0-36 Kernel (Suggested by Ceph.com) XFS Result:auto-reset on OSD ( ~ 30 mins after the test started) 5. RBD:Ubuntu 13.04+3.6.7 Kernel (Suggested by Ceph.com) OSD:Ubuntu 13.04 +3.6.7 (Suggested by Sage) XFS Result: seems stable for last 1 hour, still running till now Test 34 are repeatable. My test setup OSD side: 3 nodes, 60 Disks(20 per nodes,1 per OSD),10Gb E, 4 *Intel 520 SSD per node as journal,XFS For each node,2 * Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GH + 128GB RAM were used. RBD side: 8 nodes,for each node:10Gb E,2 * Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GH , 128GB RAM Method: Create 240 RBD and mounted to 8 nodes ( 30 RBD per nodes), doing DD concurrently on all 240 RBDs. After ~ 30 minutes, it's likely to have one of the OSD node reset. Ceph OSD logs, syslog and dmesg from reseted node are available if you needed.(It looks to me that no valuable information except a lot of slow-request in OSD's log) Xiaoxi -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: 2013年1月17日 10:35 To: Chen, Xiaoxi Subject: RE: Ceph slow request unstable issue On Thu, 17 Jan 2013, Chen, Xiaoxi wrote: No, on the OSD node, not the same node. OSD node with 3.2 kernel while client node with 3.6 kernel We did suffer kernel panic on rbd client nodes but after upgrade client kernel to 3.6.6 it seems solved . Is it easy to try the 3.6 kernel on the osd nodes too? -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: 2013?1?17? 10:17 To: Chen, Xiaoxi Subject: RE: Ceph slow request unstable issue On Thu, 17 Jan 2013, Chen, Xiaoxi wrote: It is easily to reproduce in my setup... Once I have enough high load on it and waiting for tens of minutes? I can see such log. As a forecast, slow requests more than 30~60s are frequently present in ceph osd's log. Just replied to your other email. Do I understand correctly that you are seeing this problem on the *rbd client* nodes? Or also on the OSDs? Are they the same nodes? sage -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: 2013?1?17? 0:59 To: Andrey Korolyov Cc: Chen, Xiaoxi; ceph-devel@vger.kernel.org Subject: Re: Ceph slow request unstable issue Hi, On Wed, 16 Jan 2013, Andrey Korolyov wrote: On Wed, Jan 16, 2013 at 4:58 AM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote: Hi list, We are suffering from OSD or OS down when there is continuing high pressure on the Ceph rack. Basically we are on Ubuntu 12.04+ Ceph 0.56.1, 6 nodes, in each nodes with 20 * spindles + 4* SSDs as journal.(120 spindles in total) We create a lots of RBD volumes (say 240),mounting to 16 different client machines ( 15 RBD Volumes/ client) and running DD concurrently on top of each RBD. The issues are: 1. Slow requests ??From the list-archive it seems solved in 0.56.1 but we still notice such warning 2. OSD Down or even host down Like the message below.Seems some OSD has been blocking there for quite a long time. Suggestions are highly appreciate.Thanks Xiaoxi _ Bad news: I have back all my Ceph machine?s OS to kernel 3.2.0-23, which Ubuntu 12.04 use. I run dd command (dd if=/dev/zero bs=1M count=6 of=/dev/rbd${i} )on Ceph client to create data prepare test at last night. Now, I have one machine down (can?t be reached by ping), another two machine has all OSD daemon down, while the three left has some daemon down. I have many warnings in OSD log like this: no flag points reached 2013-01-15 19:14:22.769898 7f20a2d57700 0 log [WRN] : slow request 52.218106 seconds old, received at 2013-01-15 19:13:30.551718: osd_op(client.10674.1:1002417 rb.0.27a8.6b8b4567.0eba
Re: code coverage and teuthology
On 01/15/2013 06:21 PM, Josh Durgin wrote: On 01/15/2013 02:10 AM, Loic Dachary wrote: On 01/14/2013 06:26 PM, Josh Durgin wrote: Looking at how it's run automatically might help: https://github.com/ceph/teuthology/blob/master/teuthology/coverage.py#L88 You should also add 'coverage: true' for the ceph task overrides. This way daemons are killed with SIGTERM, and the atexit function that outputs coverage information will run. Then you don't need your patch changing the flavor either. For each task X, the docstring for teuthology.task.X.task documents example usage and extra options like this. Hi, That helped a lot, thanks :-) I think I'm almost there. After running: ./virtualenv/bin/teuthology --archive /tmp/a1 /srv/3node_rgw.yaml wget -O /tmp/build/tmp.tgz http://gitbuilder.ceph.com/ceph-tarball-precise-x86_64-gcov/sha1/$(cat /tmp/a1/ceph-sha1)/ceph.x86_64.tgz echo ceph_build_output_dir: /tmp/build ~/.teuthology.yaml ./virtualenv/bin/teuthology-coverage -v --html-output /tmp/html --lcov-output /tmp/lcov --cov-tools-dir /srv/teuthology/coverage /tmp I get INFO:teuthology.coverage:initializing coverage data... Retrieving source and .gcno files... Initializing lcov files... Deleting all .da files in /tmp/lcov/ceph/src and subdirectories Done. Capturing coverage data from /tmp/lcov/ceph/src Found gcov version: 4.7.2 Scanning /tmp/lcov/ceph/src for .gcno files ... Found 692 graph files in /tmp/lcov/ceph/src Processing src/test_libhadoopcephfs_build-AuthMethodList.gcno geninfo: ERROR: /tmp/lcov/ceph/src/test_libhadoopcephfs_build-AuthMethodList.gcno: reached unexpected end of file root@ceph:/srv/teuthology# ls -l /tmp/lcov/ceph/src/test_libhadoopcephfs_build-AuthMethodList.gcno -rw-r--r-- 1 root root 41088 Jan 15 09:49 /tmp/lcov/ceph/src/test_libhadoopcephfs_build-AuthMethodList.gcno I'm using lcov: LCOV version 1.9 The only problem I can think of is that the machine I'm running lcov on is a Debian GNU/Linux Wheezy, trying to analyze coverage for binaries created for Ubuntu Precise. They are both amd64 but .gcno files may have dependencies to the toolchain. Did you ever run into similar problems ? I think I did when I built and ran on debian, and it was fixed with a later version of lcov (I think 1.9-2). I didn't try doing the coverage analysis on a different distribution from where ceph was built and run though, so that may also cause some issues. It was indeed a compatibility problem : running lcov on precise works fine. Thanks :-) Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html attachment: loic.vcf signature.asc Description: OpenPGP digital signature
HOWTO: teuthology and code coverage
Hi, I'm happy to report that running teuthology to get a lcov code coverage report worked for me. http://dachary.org/wp-uploads/2013/01/teuthology/total/mon/Monitor.cc.gcov.html It took me a while to figure out the logic (thanks Josh for the help :-). I wrote a HOWTO explaining the steps in detail. It should be straightforward to run on an OpenStack tenant, using virtual machines instead of bare metal. http://dachary.org/?p=1788 Cheers attachment: loic.vcf signature.asc Description: OpenPGP digital signature