Re: OSD crash
Am 17.06.2012 23:16, schrieb Sage Weil: Hi Stefan, I opened http://tracker.newdream.net/issues/2599 to track this, but the dump strangely does not include the ceph version or commit sha1. What version were you running? Sorry that was my build system it accidently removed the .git dir while builing so the version string couldn't be compiled in. It was 5efaa8d7799347dfae38333b1fd6e1a87dc76b28 Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
stable 10GBE under 3.4.2 or 3.5.0-rc2
Hi list, i've still problems with stable network speed under recent kernels. With 3.0.32 i get stable 9.90 Gbit/s in both directions. With 3.4.2 or 3.5.0-rc2 it drops sometimes down to around 1 Gbit/s. Sadly i've no idea when this happen? It works sometimes for minutes at 9.9Gbit/s and then suddently only with 3-4Gbit/s or even 1Gbit/s using recent kernels. I'm using various tunings recommanded by intel (Improving Performance) = http://downloadmirror.intel.com/5874/eng/README.txt Has anybody a hint for me or additional settings intel does not mention? Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Updating OSD from current stable (0.47-2) to next failed with broken filestore
Hi Sage, it's fixed now in the 'next' branch. We're using XFS for data storage. Thanks for fixing this. Simon Am 17.06.12 23:22, schrieb Sage Weil: On Sun, 17 Jun 2012, Sage Weil wrote: Hi Simon, We've opened http://tracker.newdream.net/issues/2598 to track this. Actually, having looked at the code, I'm pretty sure I see the problem. I pushed a fix to the 'next' branch. Can you try the latest and see if it resolves the problem? (Also, out of curiosity, what file system are you running underneath the ceph-osd?) Thanks! sage Thanks! sage On Sat, 16 Jun 2012, Simon Frerichs | Fremaks GmbH wrote: Hi, i tried updating one of our osds from stable 0.47-2 to latest next branch and it started updating the filestore and failed. After that neither next branch osd nor stable osd would start with this filestore anymore. Is their something wrong with the filestore update? Jun 16 14:10:03 fcstore01 ceph-osd: 2012-06-16 14:10:03.134135 7ffed3e35780 0 filestore(/data/osd11) mount FIEMAP ioctl is supported and appears to work Jun 16 14:10:03 fcstore01 ceph-osd: 2012-06-16 14:10:03.134163 7ffed3e35780 0 filestore(/data/osd11) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option Jun 16 14:10:03 fcstore01 ceph-osd: 2012-06-16 14:10:03.134476 7ffed3e35780 0 filestore(/data/osd11) mount did NOT detect btrfs Jun 16 14:10:03 fcstore01 ceph-osd: 2012-06-16 14:10:03.134485 7ffed3e35780 0 filestore(/data/osd11) mount syncfs(2) syscall not support by glibc Jun 16 14:10:03 fcstore01 ceph-osd: 2012-06-16 14:10:03.134513 7ffed3e35780 0 filestore(/data/osd11) mount no syncfs(2), must use sync(2). Jun 16 14:10:03 fcstore01 ceph-osd: 2012-06-16 14:10:03.134514 7ffed3e35780 0 filestore(/data/osd11) mount WARNING: multiple ceph-osd daemons on the same host will be slow Jun 16 14:10:03 fcstore01 ceph-osd: 2012-06-16 14:10:03.134551 7ffed3e35780 -1 filestore(/data/osd11) FileStore::mount : stale version stamp detected: 2. Proceeding, do_update is set, DO NOT USE THIS OPTION IF YOU DO NOT KNOW WHAT IT DOES. More details can be found on the wiki. Jun 16 14:10:03 fcstore01 ceph-osd: 2012-06-16 14:10:03.134585 7ffed3e35780 0 filestore(/data/osd11) mount found snaps Jun 16 14:10:12 fcstore01 ceph-osd: 2012-06-16 14:10:12.531974 7ffed3e35780 0 filestore(/data/osd11) mount: enabling WRITEAHEAD journal mode: btrfs not detected Jun 16 14:10:12 fcstore01 ceph-osd: 2012-06-16 14:10:12.543721 7ffed3e35780 1 journal _open /dev/sdb1 fd 18: 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 0 Jun 16 14:10:12 fcstore01 ceph-osd: 2012-06-16 14:10:12.588059 7ffed3e35780 1 journal _open /dev/sdb1 fd 18: 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 0 Jun 16 14:10:12 fcstore01 ceph-osd: 2012-06-16 14:10:12.588905 7ffed3e35780 -1 FileStore is old at version 2. Updating... Jun 16 14:10:12 fcstore01 ceph-osd: 2012-06-16 14:10:12.588914 7ffed3e35780 -1 Removing tmp pgs Jun 16 14:10:12 fcstore01 ceph-osd: 2012-06-16 14:10:12.594362 7ffed3e35780 -1 Getting collections Jun 16 14:10:12 fcstore01 ceph-osd: 2012-06-16 14:10:12.594369 7ffed3e35780 -1 597 to process. Jun 16 14:10:12 fcstore01 ceph-osd: 2012-06-16 14:10:12.595195 7ffed3e35780 -1 0/597 processed Jun 16 14:10:12 fcstore01 ceph-osd: 2012-06-16 14:10:12.595213 7ffed3e35780 -1 Updating collection omap current version is 0 Jun 16 14:10:12 fcstore01 ceph-osd: 2012-06-16 14:10:12.662274 7ffed3e35780 -1 os/FlatIndex.cc: In function 'virtual int FlatIndex::collection_list_partial(const hobject_t, int, int, snapid_t, std::vectorhobject_t*, hobject_t*)' thread 7ffed3e35780 time 2012-06-16 14:10:12.637479#012os/FlatIndex.cc: 386: FAILED assert(0)#012#012 ceph version 0.47.2-500-g1e899d0 (commit:1e899d08e61bbba0af6f3600b6bc9a5fc9e5c2e9)#012 1: /usr/local/bin/ceph-osd() [0x6b337d]#012 2: (FileStore::collection_list_partial(coll_t, hobject_t, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x9c) [0x67b24c]#012 3: (OSD::convert_collection(ObjectStore*, coll_t)+0x529) [0x5b90e9]#012 4: (OSD::do_convertfs(ObjectStore*)+0x46f) [0x5b9b9f]#012 5: (OSD::convertfs(std::string const, std::string const)+0x47) [0x5ba127]#012 6: (main()+0x967) [0x531d07]#012 7: (__libc_start_main()+0xfd) [0x7ffed1d8aead]#012 8: /usr/local/bin/ceph-osd() [0x5357b9]#012 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Simon -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Mit freundlichen
Re: stable 10GBE under 3.4.2 or 3.5.0-rc2
maybe try to play with net.ipv4.tcp_congestion_control ? The available algorithms are: • BIC - The default on Gentoo • Reno - The classic TCP protocol. Most OSes use this. • highspeed - HighSpeed TCP: Sally Floyd's suggested algorithm • htcp - Hamilton TCP • hybla - For Satellite Links • scalable - Scalable TCP • vegas - Vegas TCP • westwood - Optimized for lossy networks http://lwn.net/Articles/128681/ The High-speed TCP algorithm is optimized for very fat pipes - 10G Ethernet and such. When things are congested, it behaves much like the Reno algorithm. When the congestion window is being increased, however, the high-speed algorithm makes use of a table to pick large increment values. This approach lets the congestion window get very large (i.e. tens of thousands of segments) quickly, and to stay large, without requiring that the network function for long periods of time without a single dropped packet. - Mail original - De: Stefan Priebe - Profihost AG s.pri...@profihost.ag À: ceph-devel@vger.kernel.org Envoyé: Lundi 18 Juin 2012 09:40:47 Objet: stable 10GBE under 3.4.2 or 3.5.0-rc2 Hi list, i've still problems with stable network speed under recent kernels. With 3.0.32 i get stable 9.90 Gbit/s in both directions. With 3.4.2 or 3.5.0-rc2 it drops sometimes down to around 1 Gbit/s. Sadly i've no idea when this happen? It works sometimes for minutes at 9.9Gbit/s and then suddently only with 3-4Gbit/s or even 1Gbit/s using recent kernels. I'm using various tunings recommanded by intel (Improving Performance) = http://downloadmirror.intel.com/5874/eng/README.txt Has anybody a hint for me or additional settings intel does not mention? Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- -- Alexandre D erumier Ingénieur Système Fixe : 03 20 68 88 90 Fax : 03 20 68 90 81 45 Bvd du Général Leclerc 59100 Roubaix - France 12 rue Marivaux 75002 Paris - France -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: all rbd users: set 'filestore fiemap = false'
Hi Sage, On 06/18/2012 06:02 AM, Sage Weil wrote: If you are using RBD, and want to avoid potential image corruption, add filestore fiemap = false to the [osd] section of your ceph.conf and restart your OSDs. as far as this heals some trouble, but I fairly don't understand... We've tracked down the source of some corruption to racy/buggy FIEMAP ioctl behavior. The RBD client (when caching is diabled--the default) uses a 'sparse read' operation that the OSD implements by doing an fsync on the object file, mapping which extents are allocated, and sending only that data over the wire. We have observed incorrect/changing FIEMAP on both btrfs: fsync fiemap returns mapping time passes, no modifications to file fiemap returns different mapping ... that even an initial start of a VM leads to corruption of the read data? I get s/t like: --- 8- --- Loading, please wait /sbin/init: relocation error: ... not defined in file libc.so.6... [ 0.81...] Kernel panic - not snycing: Attempted to kill init! --- 8- --- host-kernel is now 3.4.1 + qemu-1.0.1, but shows failures with other kernel/qemu-versions, too. Keeping fingers crossed for Josh, though ;-) Give me a shout, If I can do some debugging, regards, Oliver. Josh is still tracking down which kernels and file system are affected; fortunately it is relatively easy to reproduce with the test_librbd_fsx tool. In the meantime, the (mis)feature can be safely disabled. It will default to off in 0.48. It is unclear whether it's really much of a performance win anyway. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Oliver Francke filoo GmbH Moltkestraße 25a 0 Gütersloh HRB4355 AG Gütersloh Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: stable 10GBE under 3.4.2 or 3.5.0-rc2
Did you use same driver version with the differents kernel version ? Maybe try to desactive tso,gso,... with ethtool ? - Mail original - De: Stefan Priebe - Profihost AG s.pri...@profihost.ag À: Alexandre DERUMIER aderum...@odiso.com Cc: ceph-devel@vger.kernel.org Envoyé: Lundi 18 Juin 2012 11:11:16 Objet: Re: stable 10GBE under 3.4.2 or 3.5.0-rc2 Am 18.06.2012 10:11, schrieb Alexandre DERUMIER: maybe try to play with net.ipv4.tcp_congestion_control ? The default on my machines and even under RHEL6 is cubic. But i've now alos tried, reno, bic and highspeed. But it doesn't change anything. Everything is fine under 3.0.32 and pretty bad under 3.4.X or 3.5rc2. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: stable 10GBE under 3.4.2 or 3.5.0-rc2
Am 18.06.2012 11:21, schrieb Alexandre DERUMIER: Did you use same driver version with the differents kernel version ? Yes driver: ixgbe version: 3.9.17-NAPI. Maybe try to desactive tso,gso,... with ethtool ? Already tried that no change and remember it works fine with 3.0.32. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: stable 10GBE under 3.4.2 or 3.5.0-rc2
also , I see a new feature in 3.3 byte_queue_limits 1.5. Bufferbloat fighting: Byte queue limits https://lwn.net/Articles/454390/ Bufferbloat is a term used to describe the latency and throughput problems caused by excessive buffering trough the several elements of a network connection. Some tools are being developed to help to alleviate these problems, and this feature is one of them. Byte queue limits are a configurable limit of packet data that can be put in the transmission queue of a network device. As a result one can tune things such that high priority packets get serviced with a reasonable amount of latency whilst not subjecting the hardware queue to emptying when data is available to send. Configuration of the queue limits is in the tx-n sysfs directory for the queue under the byte_queue_limits directory. I see some bug report about spike: http://www.mail-archive.com/e1000-devel@lists.sourceforge.net/msg05538.html maybe can you try to play with values in /sys/class/net/eth0/queues/tx-0/byte_queue_limits/* - Mail original - De: Alexandre DERUMIER aderum...@odiso.com À: Stefan Priebe - Profihost AG s.pri...@profihost.ag Cc: ceph-devel@vger.kernel.org Envoyé: Lundi 18 Juin 2012 11:21:09 Objet: Re: stable 10GBE under 3.4.2 or 3.5.0-rc2 Did you use same driver version with the differents kernel version ? Maybe try to desactive tso,gso,... with ethtool ? - Mail original - De: Stefan Priebe - Profihost AG s.pri...@profihost.ag À: Alexandre DERUMIER aderum...@odiso.com Cc: ceph-devel@vger.kernel.org Envoyé: Lundi 18 Juin 2012 11:11:16 Objet: Re: stable 10GBE under 3.4.2 or 3.5.0-rc2 Am 18.06.2012 10:11, schrieb Alexandre DERUMIER: maybe try to play with net.ipv4.tcp_congestion_control ? The default on my machines and even under RHEL6 is cubic. But i've now alos tried, reno, bic and highspeed. But it doesn't change anything. Everything is fine under 3.0.32 and pretty bad under 3.4.X or 3.5rc2. Stefan -- -- Alexandre D erumier Ingénieur Système Fixe : 03 20 68 88 90 Fax : 03 20 68 90 81 45 Bvd du Général Leclerc 59100 Roubaix - France 12 rue Marivaux 75002 Paris - France -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sync Ceph packaging efforts for Debian/Ubuntu
Hi Sage/Laszlo Laszlo - thanks for sending the original email - I'd like to get everything as closely in-sync as possible between the three packaging sources as well. On 16/06/12 22:50, Sage Weil wrote: I've take a closer look at these patches, and have a few questions. - The URL change and nss patches I've applied; they are in the ceph.git 'debian' branch. Great! - Has the leveldb patch been sent upstream? Once it is committed to the upstream git, we can update ceph to use it; that's nicer than carrying the patch. However, I thought you needed to link against the existing libleveldb1 package... which means we shouldn't do anything on our side, right? I can't see any evidence that this has been sent upstream; ideally we would be building against libleveldb1 rather than using the embedded copy - I'm not familiar with the reason that this has not happened already (if there is one). This package would also need to be reviewed for inclusion in main if that was the case. - I'm not sure how useful it is to break mount.ceph and cephfs into a separate ceph-fs-common package, but we can do it. Same goes for a separate package for ceph-mds. That was originally motivated by ubuntu not wanting the mds in main, but in the end only the libraries went in, so it's a moot point. I'd rather hear from them what their intentions are for 12.10 before complicating things... ceph-fs-common is in Ubuntu main; so I think the original motivation still stands IMHO. For the Ubuntu quantal cycle we still have the same primary objective as we had during 12.04; namely ensuring that Ceph RBD can be used as a block store for qemu-kvm which ties nicely into the Ubuntu OpenStack story through Cinder; In addition we will be looking at Ceph RADOS as a backend for Glance (see [0] for more details). The MIR for Ceph occurred quite late in the 12.04 cycle so we had to trim the scope to actually get it done; We will be looking at libfcgi and google-perftools this cycle for main inclusion to re-enable the components that are currently disabled in the Ubuntu packaging. - That same patch also switched all the Architecture: lines back to linux-any. Was that intentional? I just changed them from that last week. I think linux-any is correct - the change you have made would exclude the PPC architecture in Ubuntu and Debian. [...] Ben, James, can you please share in some sentences why ceph-fuse is dropped in Ubuntu? Do you need it Sage? If it's feasible, you may drop that as well. There is an outstanding question on the 12.04 MIR as to whether this package could still be built but not promoted to main - I'll follow up with the MIR reviewer as to whether that's possible as I don't think it requires any additional build dependencies. [...] I hope that explains the Ubuntu position on Ceph and what plans we have this development cycle. I expect Clint will chip in if I have missed anything. Cheers James [0] https://blueprints.launchpad.net/ubuntu/+spec/servercloud-q-ceph-object-integration -- James Page Ubuntu Core Developer Debian Maintainer james.p...@ubuntu.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: extent rbd ls to show size and free space?
Hi, Hello list, are there any plans to extent rbd ls in a way that it shows, image size and free space of the pool? You want something like: $ rbd ls NAMESIZE alpha 50G beta400G charlie 150G That is possible, but if you want to see the allocation of an image that will be harder, since RBD doesn't know which objects have been written to and which haven't. There is also no such thing as free pool space, you have to look at the free cluster space. But again, if you cluster has X TB of free space and you have multiple pools, your usage will depend on the amount of written data and the replication level. If you want to know the usage of the rbd pool I suggest using rados df. The RBD tool could however be modified to show the image size if you give another flag. $ rbd --extended ls Wido On 06/18/2012 02:08 PM, Stefan Priebe - Profihost AG wrote: Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: extent rbd ls to show size and free space?
Am 18.06.2012 14:47, schrieb Wido den Hollander: Hi, Hello list, are there any plans to extent rbd ls in a way that it shows, image size and free space of the pool? You want something like: $ rbd ls NAME SIZE alpha 50G beta 400G charlie 150G Yes That is possible, but if you want to see the allocation of an image that will be harder, since RBD doesn't know which objects have been written to and which haven't. Sure i didn't mean that. If you want to know the usage of the rbd pool I suggest using rados df. ah OK. The RBD tool could however be modified to show the image size if you give another flag. $ rbd --extended ls That would be great. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: extent rbd ls to show size and free space?
On 06/18/2012 03:03 PM, Stefan Priebe - Profihost AG wrote: Am 18.06.2012 14:47, schrieb Wido den Hollander: Hi, Hello list, are there any plans to extent rbd ls in a way that it shows, image size and free space of the pool? You want something like: $ rbd ls NAME SIZE alpha 50G beta 400G charlie 150G Yes That is possible, but if you want to see the allocation of an image that will be harder, since RBD doesn't know which objects have been written to and which haven't. Sure i didn't mean that. If you want to know the usage of the rbd pool I suggest using rados df. ah OK. The RBD tool could however be modified to show the image size if you give another flag. $ rbd --extended ls That would be great. I created an issue in the tracker for it: http://tracker.newdream.net/issues/2601 Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?
On 6/18/12 7:34 AM, Alexandre DERUMIER wrote: Hi, I'm doing test with rados bench, and I see constant writes to osd disks. Is it the normal behaviour ? with write-ahead should write occur each 20-30 seconde ? Cluster is 3 nodes (ubuntu precise - glibc 2.14 - ceph 0.47.2) with each node 1 journal on tmpfs 8GB - 1 osd (xfs) on sas disk - 1 gigabit link 8GB journal can handle easily 20s of write (1 gigabit link) [osd] osd data = /srv/osd.$id osd journal = /tmpfs/osd.$id.journal osd journal size = 8000 journal dio = false filestore journal parallel = false filestore journal writeahead = true filestore fiemap = false I have done tests with differents kernel (3.0,3.2,3.4) , differents filesystem (xfs,btrfs,ext4), forced journal mode to writeahead. Bench were done write rados bench and fio. I always have constant write since the first second of bench start. Any idea ? Hi Alex, Sorry I got behind at looking at your output last week. I've created a seekwatcher movie of your blktrace results here: http://nhm.ceph.com/movies/mailinglist-tests/alex-test-3.4.mpg The results match up well with your iostat output. Peaks and valleys in the writes every couple of seconds. Low numbers of seeks, so probably not limited by the filestore (a quick osd tell X bench might confirm that). I'm wondering if you increase filestore max sync interval to something bigger (default is 5s) if you'd see somewhat different behavior. Maybe try something like 30s and see what happens? Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: stable 10GBE under 3.4.2 or 3.5.0-rc2
On 6/18/12 2:40 AM, Stefan Priebe - Profihost AG wrote: Hi list, i've still problems with stable network speed under recent kernels. With 3.0.32 i get stable 9.90 Gbit/s in both directions. With 3.4.2 or 3.5.0-rc2 it drops sometimes down to around 1 Gbit/s. Sadly i've no idea when this happen? It works sometimes for minutes at 9.9Gbit/s and then suddently only with 3-4Gbit/s or even 1Gbit/s using recent kernels. I'm using various tunings recommanded by intel (Improving Performance) = http://downloadmirror.intel.com/5874/eng/README.txt Hi Stefan, Did you ever get a chance to talk to Jason Wang about the commit that was causing the problems? It might be a good idea to report all of this upstream and see what they have to say. Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: stable 10GBE under 3.4.2 or 3.5.0-rc2
Am 18.06.2012 15:56, schrieb Mark Nelson: On 6/18/12 2:40 AM, Stefan Priebe - Profihost AG wrote: Hi list, i've still problems with stable network speed under recent kernels. With 3.0.32 i get stable 9.90 Gbit/s in both directions. With 3.4.2 or 3.5.0-rc2 it drops sometimes down to around 1 Gbit/s. Sadly i've no idea when this happen? It works sometimes for minutes at 9.9Gbit/s and then suddently only with 3-4Gbit/s or even 1Gbit/s using recent kernels. I'm using various tunings recommanded by intel (Improving Performance) = http://downloadmirror.intel.com/5874/eng/README.txt Hi Stefan, Did you ever get a chance to talk to Jason Wang about the commit that was causing the problems? It might be a good idea to report all of this upstream and see what they have to say. Hi Mark, yes i talked to him and he told me that his change was partically removed in more recent kernel version. Right now i'm in discussion with eric dumazet at the netdev kernel mailinglist. And he gives me some good advices which seem to work well. I'm still testing and will share the results here. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?
Hi Mark, Sorry I got behind at looking at your output last week. I've created a seekwatcher movie of your blktrace results here: http://nhm.ceph.com/movies/mailinglist-tests/alex-test-3.4.mpg how do you create seekwatcher movie from blktrace ? (I'd like to create them myself, seem good to debug) The results match up well with your iostat output. Peaks and valleys in the writes every couple of seconds. Low numbers of seeks, so probably not limited by the filestore (a quick osd tell X bench might confirm that). yet, i'm pretty sure that the limitation if not hardware. (each osd are 15k drive, handling around 10MB/S during the test, so I think it should be ok ^_^ ) how do you use osd tell X bench ? I'm wondering if you increase filestore max sync interval to something bigger (default is 5s) if you'd see somewhat different behavior. Maybe try something like 30s and see what happens? I have done test with 30s, that doesn't change nothing. I have try with filestore min sync interval = 29 + filestore max sync interval = 30 - Mail original - De: Mark Nelson mark.nel...@inktank.com À: Alexandre DERUMIER aderum...@odiso.com Cc: ceph-devel@vger.kernel.org Envoyé: Lundi 18 Juin 2012 15:29:58 Objet: Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ? On 6/18/12 7:34 AM, Alexandre DERUMIER wrote: Hi, I'm doing test with rados bench, and I see constant writes to osd disks. Is it the normal behaviour ? with write-ahead should write occur each 20-30 seconde ? Cluster is 3 nodes (ubuntu precise - glibc 2.14 - ceph 0.47.2) with each node 1 journal on tmpfs 8GB - 1 osd (xfs) on sas disk - 1 gigabit link 8GB journal can handle easily 20s of write (1 gigabit link) [osd] osd data = /srv/osd.$id osd journal = /tmpfs/osd.$id.journal osd journal size = 8000 journal dio = false filestore journal parallel = false filestore journal writeahead = true filestore fiemap = false I have done tests with differents kernel (3.0,3.2,3.4) , differents filesystem (xfs,btrfs,ext4), forced journal mode to writeahead. Bench were done write rados bench and fio. I always have constant write since the first second of bench start. Any idea ? Hi Alex, Sorry I got behind at looking at your output last week. I've created a seekwatcher movie of your blktrace results here: http://nhm.ceph.com/movies/mailinglist-tests/alex-test-3.4.mpg The results match up well with your iostat output. Peaks and valleys in the writes every couple of seconds. Low numbers of seeks, so probably not limited by the filestore (a quick osd tell X bench might confirm that). I'm wondering if you increase filestore max sync interval to something bigger (default is 5s) if you'd see somewhat different behavior. Maybe try something like 30s and see what happens? Mark -- -- Alexandre D erumier Ingénieur Système Fixe : 03 20 68 88 90 Fax : 03 20 68 90 81 45 Bvd du Général Leclerc 59100 Roubaix - France 12 rue Marivaux 75002 Paris - France -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?
On 6/18/12 9:04 AM, Alexandre DERUMIER wrote: Hi Mark, Sorry I got behind at looking at your output last week. I've created a seekwatcher movie of your blktrace results here: http://nhm.ceph.com/movies/mailinglist-tests/alex-test-3.4.mpg how do you create seekwatcher movie from blktrace ? (I'd like to create them myself, seem good to debug) You'll need to download seekwatcher from Chris Mason's website. Get the newest unstable version. To make movies you'll need mencoder. (It also needs numpy and matplotlib). There is a small bug in the code where /dev/null should be changed to /dev/null 21. If you have trouble let me know and I can send you a fixed version of the script. The results match up well with your iostat output. Peaks and valleys in the writes every couple of seconds. Low numbers of seeks, so probably not limited by the filestore (a quick osd tell X bench might confirm that). yet, i'm pretty sure that the limitation if not hardware. (each osd are 15k drive, handling around 10MB/S during the test, so I think it should be ok ^_^ ) how do you use osd tell X bench ? Yeah, I just wanted to make sure that the constant writes weren't because the filestore was falling behind. You may want to take a look at some of the information that is provided by the admin socket for the OSD while the test is running. dump_ops_in_flight, perf schema, and perf dump are all useful. Try: ceph --admin-daemon socket help The osd admin sockets should be available in /var/run/ceph. I'm wondering if you increase filestore max sync interval to something bigger (default is 5s) if you'd see somewhat different behavior. Maybe try something like 30s and see what happens? I have done test with 30s, that doesn't change nothing. I have try with filestore min sync interval = 29 + filestore max sync interval = 30 Nuts. Do you still see the little peaks/valleys every couple seconds? - Mail original - De: Mark Nelsonmark.nel...@inktank.com À: Alexandre DERUMIERaderum...@odiso.com Cc: ceph-devel@vger.kernel.org Envoyé: Lundi 18 Juin 2012 15:29:58 Objet: Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ? On 6/18/12 7:34 AM, Alexandre DERUMIER wrote: Hi, I'm doing test with rados bench, and I see constant writes to osd disks. Is it the normal behaviour ? with write-ahead should write occur each 20-30 seconde ? Cluster is 3 nodes (ubuntu precise - glibc 2.14 - ceph 0.47.2) with each node 1 journal on tmpfs 8GB - 1 osd (xfs) on sas disk - 1 gigabit link 8GB journal can handle easily 20s of write (1 gigabit link) [osd] osd data = /srv/osd.$id osd journal = /tmpfs/osd.$id.journal osd journal size = 8000 journal dio = false filestore journal parallel = false filestore journal writeahead = true filestore fiemap = false I have done tests with differents kernel (3.0,3.2,3.4) , differents filesystem (xfs,btrfs,ext4), forced journal mode to writeahead. Bench were done write rados bench and fio. I always have constant write since the first second of bench start. Any idea ? Hi Alex, Sorry I got behind at looking at your output last week. I've created a seekwatcher movie of your blktrace results here: http://nhm.ceph.com/movies/mailinglist-tests/alex-test-3.4.mpg The results match up well with your iostat output. Peaks and valleys in the writes every couple of seconds. Low numbers of seeks, so probably not limited by the filestore (a quick osd tell X bench might confirm that). I'm wondering if you increase filestore max sync interval to something bigger (default is 5s) if you'd see somewhat different behavior. Maybe try something like 30s and see what happens? Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?
Yeah, I just wanted to make sure that the constant writes weren't because the filestore was falling behind. You may want to take a look at some of the information that is provided by the admin socket for the OSD while the test is running. dump_ops_in_flight, perf schema, and perf dump are all useful. don't know which values to check in these big json reponses ;) But I have try with more osd, so write are splitted on more disks and and write are smaller, and the behaviour is same root@cephtest1:/var/run/ceph# ceph --admin-daemon ceph-osd.0.asok dump_ops_in_flight { num_ops: 1, ops: [ { description: osd_op(client.4179.0:83 kvmtest1_1006560_object82 [write 0~4194304] 3.9f5c55af), received_at: 2012-06-18 16:41:17.995167, age: 0.406678, flag_point: waiting for sub ops, client_info: { client: client.4179, tid: 83}}]} root@cephtest1:/var/run/ceph# ceph --admin-daemon ceph-osd.0.asok perfcounters_dump {filestore:{journal_queue_max_ops:500,journal_queue_ops:0,journal_ops:2198,journal_queue_max_bytes:104857600,journal_queue_bytes:0,journal_bytes:1012769525,journal_latency:{avgcount:2198,sum:3.13569},op_queue_max_ops:500,op_queue_ops:0,ops:2198,op_queue_max_bytes:104857600,op_queue_bytes:0,bytes:1012757330,apply_latency:{avgcount:2198,sum:290.27},committing:0,commitcycle:59,commitcycle_interval:{avgcount:59,sum:300.04},commitcycle_latency:{avgcount:59,sum:4.76299},journal_full:0},osd:{opq:0,op_wip:0,op:127,op_in_bytes:532692449,op_out_bytes:0,op_latency:{avgcount:127,sum:49.2627},op_r:0,op_r_out_bytes:0,op_r_latency:{avgcount:0,sum:0},op_w:127,op_w_in_bytes:532692449,op_w_rlat:{avgcount:127,sum:0},op_w_latency:{avgcount:127,sum:49.2627},op_rw:0,op_rw_in_bytes:0,op_rw_out_bytes:0,op_rw_rlat:{avgcount:0,sum:0},op_rw_latency:{avgcount:0,sum:0},subop:114,subop_in_bytes:478212311,subop_latency:{avgcount:114,sum:8.82174},subop_w:0,subop_w_in_bytes:478212311,subop_w_latency:{avgcount:114,sum:8.82174},subop_pull:0,subop_pull_latency:{avgcount:0,sum:0},subop_push:0,subop_push_in_bytes:0,subop_push_latency:{avgcount:0,sum:0},pull:0,push:0,push_out_bytes:0,recovery_ops:0,loadavg:0.47,buffer_bytes:0,numpg:423,numpg_primary:259,numpg_replica:164,numpg_stray:0,heartbeat_to_peers:10,heartbeat_from_peers:0,map_messages:34,map_message_epochs:44,map_message_epoch_dups:24},throttle-filestore_bytes:{val:0,max:104857600,get:0,get_sum:0,get_or_fail_fail:0,get_or_fail_success:0,take:2198,take_sum:1012769525,put:1503,put_sum:1012769525,wait:{avgcount:0,sum:0}},throttle-filestore_ops:{val:0,max:500,get:0,get_sum:0,get_or_fail_fail:0,get_or_fail_success:0,take:2198,take_sum:2198,put:1503,put_sum:2198,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-client:{val:4194469,max:104857600,get:243,get_sum:536987810,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:242,put_sum:532793341,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-cluster:{val:0,max:104857600,get:1480,get_sum:482051948,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:1480,put_sum:482051948,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-hbclient:{val:0,max:104857600,get:1077,get_sum:50619,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:1077,put_sum:50619,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-hbserver:{val:0,max:104857600,get:972,get_sum:45684,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:972,put_sum:45684,wait:{avgcount:0,sum:0}},throttle-osd_client_bytes:{val:4194469,max:524288000,get:128,get_sum:536892019,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:254,put_sum:532697550,wait:{avgcount:0,sum:0}}} root@cephtest1:/var/run/ceph# ceph --admin-daemon ceph-osd.0.asok perfcounters_schema
Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?
forget to send iostat -x 1 trace (osd's are on sdb,sbc,sdd,sde,sdf) Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,0055,000,00 31,00 0,00 1468,0094,71 0,216,770,006,77 5,16 16,00 sdb 0,00 0,000,00 74,00 0,00 20516,00 554,49 2,74 38,780,00 38,78 3,51 26,00 sdc 0,00 0,000,00 57,00 0,00 15520,00 544,56 1,77 28,600,00 28,60 3,68 21,00 sdd 0,00 0,000,00 16,00 0,00 4108,00 513,50 0,52 32,500,00 32,50 4,38 7,00 sde 0,00 0,000,00 15,00 0,00 4104,00 547,20 0,48 32,000,00 32,00 4,00 6,00 sdf 0,00 0,000,00 46,00 0,00 12316,00 535,48 1,42 30,870,00 30,87 3,70 17,00 avg-cpu: %user %nice %system %iowait %steal %idle 1,550,007,091,160,00 90,21 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,0037,000,00 20,00 0,00 236,0023,60 0,126,000,006,00 5,00 10,00 sdb 0,00 0,000,00 41,00 0,00 10780,00 525,85 1,03 21,460,00 21,46 3,66 15,00 sdc 0,00 0,000,00 78,00 0,00 21416,00 549,13 3,20 42,820,00 42,82 3,08 24,00 sdd 0,0018,000,00 121,00 0,00 24859,00 410,89 3,00 24,790,00 24,79 3,06 37,00 sde 0,00 0,000,000,00 0,00 0,00 0,00 0,000,000,000,00 0,00 0,00 sdf 0,0015,000,00 75,00 0,00 12521,00 333,89 2,12 28,270,00 28,27 3,47 26,00 avg-cpu: %user %nice %system %iowait %steal %idle 2,510,006,521,380,00 89,59 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,0030,000,00 19,00 0,00 204,0021,47 0,105,260,005,26 5,26 10,00 sdb 0,0023,000,00 105,00 0,00 18281,50 348,22 3,92 38,670,00 38,67 3,33 35,00 sdc 0,00 0,000,00 31,00 0,00 8212,00 529,81 0,89 28,710,00 28,71 3,87 12,00 sdd 0,00 0,000,00 45,00 0,00 12312,00 547,20 1,35 30,000,00 30,00 3,78 17,00 sde 0,0017,000,00 42,00 0,00 4308,00 205,14 1,14 27,140,00 27,14 3,33 14,00 sdf 0,00 0,000,00 45,00 0,00 12312,00 547,20 1,33 29,560,00 29,56 3,78 17,00 avg-cpu: %user %nice %system %iowait %steal %idle 2,280,004,310,000,00 93,41 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,000,000,00 0,00 0,00 0,00 0,000,000,000,00 0,00 0,00 sdb 0,00 0,000,00 29,00 0,00 8204,00 565,79 0,89 31,030,00 31,03 3,45 10,00 sdc 0,0021,000,00 85,00 0,00 12627,50 297,12 2,66 31,290,00 31,29 2,94 25,00 sdd 0,00 0,000,00 16,00 0,00 4108,00 513,50 0,45 28,120,00 28,12 4,38 7,00 sde 0,00 0,000,00 75,00 0,00 20520,00 547,20 2,32 30,930,00 30,93 3,47 26,00 sdf 0,00 0,000,00 17,00 0,00 4112,00 483,76 0,39 22,940,00 22,94 2,94 5,00 avg-cpu: %user %nice %system %iowait %steal %idle 1,920,008,971,540,00 87,56 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,0051,000,00 32,00 0,00 1432,0089,50 0,217,190,007,19 5,00 16,00 sdb 0,00 0,000,00 60,00 0,00 16416,00 547,20 1,59 26,500,00 26,50 3,33 20,00 sdc 0,00 0,000,00 48,00 0,00 12324,00 513,50 1,41 23,960,00 23,96 3,54 17,00 sdd 0,00 0,000,00 31,00 0,00 8212,00 529,81 0,79 25,480,00 25,48 3,23 10,00 sde 0,00 0,000,00 66,00 0,00 17704,00 536,48 2,96 40,760,00 40,76 3,79 25,00 sdf 0,00 0,000,00 46,00 0,00 12316,00 535,48 1,33 28,910,00 28,91 3,91 18,00 avg-cpu: %user %nice %system %iowait %steal %idle 2,290,005,221,660,00 90,83 Device: rrqm/s wrqm/s r/s w/s
Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?
Forget to say: The blktrace of the osd was done with 15 osd on 3 nodes. So the peak and valley could come from rbd block distribution. I have done same test with 1 osd by node with 3 nodes, I have around 60MB/S by disk (with same behaviour) So this is not a bottleneck. I'm going to do some blktrace and seekwatcher move with 1 osd by node. - Mail original - De: Alexandre DERUMIER aderum...@odiso.com À: Mark Nelson mark.nel...@inktank.com Cc: ceph-devel@vger.kernel.org Envoyé: Lundi 18 Juin 2012 16:50:57 Objet: Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ? forget to send iostat -x 1 trace (osd's are on sdb,sbc,sdd,sde,sdf) Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 55,00 0,00 31,00 0,00 1468,00 94,71 0,21 6,77 0,00 6,77 5,16 16,00 sdb 0,00 0,00 0,00 74,00 0,00 20516,00 554,49 2,74 38,78 0,00 38,78 3,51 26,00 sdc 0,00 0,00 0,00 57,00 0,00 15520,00 544,56 1,77 28,60 0,00 28,60 3,68 21,00 sdd 0,00 0,00 0,00 16,00 0,00 4108,00 513,50 0,52 32,50 0,00 32,50 4,38 7,00 sde 0,00 0,00 0,00 15,00 0,00 4104,00 547,20 0,48 32,00 0,00 32,00 4,00 6,00 sdf 0,00 0,00 0,00 46,00 0,00 12316,00 535,48 1,42 30,87 0,00 30,87 3,70 17,00 avg-cpu: %user %nice %system %iowait %steal %idle 1,55 0,00 7,09 1,16 0,00 90,21 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 37,00 0,00 20,00 0,00 236,00 23,60 0,12 6,00 0,00 6,00 5,00 10,00 sdb 0,00 0,00 0,00 41,00 0,00 10780,00 525,85 1,03 21,46 0,00 21,46 3,66 15,00 sdc 0,00 0,00 0,00 78,00 0,00 21416,00 549,13 3,20 42,82 0,00 42,82 3,08 24,00 sdd 0,00 18,00 0,00 121,00 0,00 24859,00 410,89 3,00 24,79 0,00 24,79 3,06 37,00 sde 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 sdf 0,00 15,00 0,00 75,00 0,00 12521,00 333,89 2,12 28,27 0,00 28,27 3,47 26,00 avg-cpu: %user %nice %system %iowait %steal %idle 2,51 0,00 6,52 1,38 0,00 89,59 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 30,00 0,00 19,00 0,00 204,00 21,47 0,10 5,26 0,00 5,26 5,26 10,00 sdb 0,00 23,00 0,00 105,00 0,00 18281,50 348,22 3,92 38,67 0,00 38,67 3,33 35,00 sdc 0,00 0,00 0,00 31,00 0,00 8212,00 529,81 0,89 28,71 0,00 28,71 3,87 12,00 sdd 0,00 0,00 0,00 45,00 0,00 12312,00 547,20 1,35 30,00 0,00 30,00 3,78 17,00 sde 0,00 17,00 0,00 42,00 0,00 4308,00 205,14 1,14 27,14 0,00 27,14 3,33 14,00 sdf 0,00 0,00 0,00 45,00 0,00 12312,00 547,20 1,33 29,56 0,00 29,56 3,78 17,00 avg-cpu: %user %nice %system %iowait %steal %idle 2,28 0,00 4,31 0,00 0,00 93,41 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 sdb 0,00 0,00 0,00 29,00 0,00 8204,00 565,79 0,89 31,03 0,00 31,03 3,45 10,00 sdc 0,00 21,00 0,00 85,00 0,00 12627,50 297,12 2,66 31,29 0,00 31,29 2,94 25,00 sdd 0,00 0,00 0,00 16,00 0,00 4108,00 513,50 0,45 28,12 0,00 28,12 4,38 7,00 sde 0,00 0,00 0,00 75,00 0,00 20520,00 547,20 2,32 30,93 0,00 30,93 3,47 26,00 sdf 0,00 0,00 0,00 17,00 0,00 4112,00 483,76 0,39 22,94 0,00 22,94 2,94 5,00 avg-cpu: %user %nice %system %iowait %steal %idle 1,92 0,00 8,97 1,54 0,00 87,56 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 51,00 0,00 32,00 0,00 1432,00 89,50 0,21 7,19 0,00 7,19 5,00 16,00 sdb 0,00 0,00 0,00 60,00 0,00 16416,00 547,20 1,59 26,50 0,00 26,50 3,33 20,00 sdc 0,00 0,00 0,00 48,00 0,00 12324,00 513,50 1,41 23,96 0,00 23,96 3,54 17,00 sdd 0,00 0,00 0,00 31,00 0,00 8212,00 529,81 0,79 25,48 0,00 25,48 3,23 10,00 sde 0,00 0,00 0,00 66,00 0,00 17704,00 536,48 2,96 40,76 0,00 40,76 3,79 25,00 sdf 0,00 0,00 0,00 46,00 0,00 12316,00 535,48 1,33 28,91 0,00 28,91 3,91 18,00 avg-cpu: %user %nice %system %iowait %steal %idle 2,29 0,00 5,22 1,66 0,00 90,83 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 51,00 0,00 30,00 0,00 1460,00 97,33 0,15 5,00 0,00 5,00 4,67 14,00 sdb 0,00 0,00 0,00 45,00 0,00 12312,00 547,20 1,31 29,11 0,00 29,11 3,78 17,00 sdc 0,00 0,00 0,00 29,00 0,00 8204,00 565,79 0,62 30,34 0,00 30,34 3,45 10,00 sdd 0,00 0,00 0,00 33,00 0,00 8220,00 498,18 1,13 30,30 0,00 30,30 4,24 14,00 sde 0,00 0,00 0,00 40,00 0,00 11028,00 551,40 0,91 29,50 0,00 29,50 3,50 14,00 sdf 0,00 0,00 0,00 64,00 0,00 16432,00 513,50 1,69 26,41 0,00 26,41 3,91 25,00 avg-cpu: %user %nice %system %iowait %steal %idle 1,93 0,00 6,05 1,93 0,00 90,09 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 34,00 0,00 19,00 0,00 220,00 23,16 0,11 5,79 0,00 5,79 5,79 11,00 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 sdc 0,00 0,00 0,00 45,00 0,00 12312,00 547,20 1,13 25,11 0,00 25,11 3,33 15,00 sdd 0,00 25,00 0,00 110,00 0,00 20841,00 378,93 3,39
Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?
On 6/18/12 9:47 AM, Alexandre DERUMIER wrote: Yeah, I just wanted to make sure that the constant writes weren't because the filestore was falling behind. You may want to take a look at some of the information that is provided by the admin socket for the OSD while the test is running. dump_ops_in_flight, perf schema, and perf dump are all useful. don't know which values to check in these big json reponses ;) But I have try with more osd, so write are splitted on more disks and and write are smaller, and the behaviour is same No worries, there is a lot of data there! root@cephtest1:/var/run/ceph# ceph --admin-daemon ceph-osd.0.asok dump_ops_in_flight { num_ops: 1, ops: [ { description: osd_op(client.4179.0:83 kvmtest1_1006560_object82 [write 0~4194304] 3.9f5c55af), received_at: 2012-06-18 16:41:17.995167, age: 0.406678, flag_point: waiting for sub ops, client_info: { client: client.4179, tid: 83}}]} root@cephtest1:/var/run/ceph# ceph --admin-daemon ceph-osd.0.asok perfcounters_dump {filestore:{journal_queue_max_ops:500,journal_queue_ops:0,journal_ops:2198,journal_queue_max_bytes:104857600,journal_queue_bytes:0,journal_bytes:1012769525,journal_latency:{avgcount:2198,sum:3.13569},op_queue_max_ops:500,op_queue_ops:0,ops:2198,op_queue_max_bytes:104857600,op_queue_bytes:0,bytes:1012757330,apply_latency:{avgcount:2198,sum:290.27},committing:0,commitcycle:59,commitcycle_interval:{avgcount:59,sum:300.04},commitcycle_latency:{avgcount:59,sum:4.76299},journal_full:0},osd:{opq:0,op_wip:0,op:127,op_in_bytes:532692449,op_out_bytes:0,op_latency:{avgcount:127,sum:49.2627},op_r:0,op_r_out_bytes:0,op_r_latency:{avgcount:0,sum:0},op_w:127,op_w_in_bytes:532692449,op_w_rlat:{avgcount:127,sum:0},op_w_latency:{avgcount:127,sum:49.2627},op_rw:0,op_rw_in_bytes:0,op_rw_out_bytes:0,op_rw_rlat:{avgcount:0,sum:0},op_rw_latency:{avgcount:0,sum:0},subop:114,subop_in_byte s:478212311,subop_latency:{avgcount:114,sum:8.82174},subop_w:0,subop_w_in_bytes:478212311,subop_w_latency:{avgcount:114,sum:8.82174},subop_pull:0,subop_pull_latency:{avgcount:0,sum:0},subop_push:0,subop_push_in_bytes:0,subop_push_latency:{avgcount:0,sum:0},pull:0,push:0,push_out_bytes:0,recovery_ops:0,loadavg:0.47,buffer_bytes:0,numpg:423,numpg_primary:259,numpg_replica:164,numpg_stray:0,heartbeat_to_peers:10,heartbeat_from_peers:0,map_messages:34,map_message_epochs:44,map_message_epoch_dups:24},throttle-filestore_bytes:{val:0,max:104857600,get:0,get_sum:0,get_or_fail_fail:0,get_or_fail_success:0,take:2198,take_sum:1012769525,put:1503,put_sum:1012769525,wait:{avgcount:0,sum:0}},throttle-filestore_ops:{val:0,max:500,get:0,get_sum:0,get_or_fail_fail:0,get_or_fail_success:0,take:2198,take_sum:2198,put:1503,put_sum:2198,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_t hrottler-client:{val:4194469,max:104857600,get:243,get_sum:536987810,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:242,put_sum:532793341,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-cluster:{val:0,max:104857600,get:1480,get_sum:482051948,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:1480,put_sum:482051948,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-hbclient:{val:0,max:104857600,get:1077,get_sum:50619,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:1077,put_sum:50619,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-hbserver:{val:0,max:104857600,get:972,get_sum:45684,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:972,put_sum:45684,wait:{avgcount:0,sum:0}},throttle-osd_client_bytes:{val:4194469,max:524288000,get:128,get_sum:536892019,get_or_fail_fail:0,get_or_fail_su ccess:0,take:0,take_sum:0,put:254,put_sum:532697550,wait:{avgcount:0,sum:0}}} root@cephtest1:/var/run/ceph# ceph --admin-daemon ceph-osd.0.asok perfcounters_schema {filestore:{journal_queue_max_ops:{type:2},journal_queue_ops:{type:2},journal_ops:{type:10},journal_queue_max_bytes:{type:2},journal_queue_bytes:{type:2},journal_bytes:{type:10},journal_latency:{type:5},op_queue_max_ops:{type:2},op_queue_ops:{type:2},ops:{type:10},op_queue_max_bytes:{type:2},op_queue_bytes:{type:2},bytes:{type:10},apply_latency:{type:5},committing:{type:2},commitcycle:{type:10},commitcycle_interval:{type:5},commitcycle_latency:{type:5},journal_full:{type:10}},osd:{opq:{type:2},op_wip:{type:2},op:{type:10},op_in_bytes:{type:10},op_out_bytes:{type:10},op_latency:{type:5},op_r:{type:10},op_r_out_bytes:{type:10},op_r_latency:{type:5},op_w:{type:10},op_w_in_bytes:{type:10},op_w_rlat:{type:5},op_w_latency:{type:5},op_rw:{type:10},op_rw_in_bytes:{type:10},op_rw_out_bytes:{type:10},op_rw_rlat:{type:5},op_rw_latency:{type:5},
[PATCH 0/6] ceph: a few more messenger cleanups
Here are a few more messenger cleanup patches. [PATCH 1/6] libceph: encapsulate out message data setup [PATCH 2/6] libceph: encapsulate advancing msg page These two encapsulate some code involved in sending message data into separate functions. [PATCH 3/6] libceph: don't mark footer complete before it is This moves the setting of the FOOTER_COMPLETE flag so that it doesn't get done until the footer really is complete. [PATCH 4/6] libceph: move init_bio_*() functions up This simply moves two functions, preparing for the next patch. [PATCH 5/6] libceph: move init of bio_iter This makes a message's bio_iter field get initialized when the rest of the message is initialized, rather than conditionally every time any attempt is made to send message data. [PATCH 6/6] libceph: don't use bio_iter as a flag Because bio_iter is now initialized in the right place we no longer need to use its value as a flag to determine whether it needs initialization. -Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: all rbd users: set 'filestore fiemap = false'
On Mon, 18 Jun 2012, Christoph Hellwig wrote: On Sun, Jun 17, 2012 at 09:02:15PM -0700, Sage Weil wrote: that data over the wire. We have observed incorrect/changing FIEMAP on both btrfs: both btrfs and? Whoops, it was XFS. :/ Btw, btrfs had SEEK_HOLE/SEEK_DATA which are a lot more useful for this kind of operations, and xfs has added support for it as well now. Yeah, started looking at that last night. (This code predates SEEK_HOLE.) sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] libceph: encapsulate advancing msg page
In write_partial_msg_pages(), once all the data from a page has been sent we advance to the next one. Put the code that takes care of this into its own function. While modifying write_partial_msg_pages(), make its local variable in_trail be Boolean, and use the local variable msg (which is just the connection's current out_msg pointer) consistently. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c | 58 +-- 1 file changed, 34 insertions(+), 24 deletions(-) Index: b/net/ceph/messenger.c === --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -915,6 +915,33 @@ static void iter_bio_next(struct bio **b } #endif +static void out_msg_pos_next(struct ceph_connection *con, struct page *page, + size_t len, size_t sent, bool in_trail) +{ + struct ceph_msg *msg = con-out_msg; + + BUG_ON(!msg); + BUG_ON(!sent); + + con-out_msg_pos.data_pos += sent; + con-out_msg_pos.page_pos += sent; + if (sent == len) { + con-out_msg_pos.page_pos = 0; + con-out_msg_pos.page++; + con-out_msg_pos.did_page_crc = false; + if (in_trail) + list_move_tail(page-lru, + msg-trail-head); + else if (msg-pagelist) + list_move_tail(page-lru, + msg-pagelist-head); +#ifdef CONFIG_BLOCK + else if (msg-bio) + iter_bio_next(msg-bio_iter, msg-bio_seg); +#endif + } +} + /* * Write as much message data payload as we can. If we finish, queue * up the footer. @@ -930,11 +957,11 @@ static int write_partial_msg_pages(struc bool do_datacrc = !con-msgr-nocrc; int ret; int total_max_write; - int in_trail = 0; + bool in_trail = false; size_t trail_len = (msg-trail ? msg-trail-length : 0); dout(write_partial_msg_pages %p msg %p page %d/%d offset %d\n, -con, con-out_msg, con-out_msg_pos.page, con-out_msg-nr_pages, +con, msg, con-out_msg_pos.page, msg-nr_pages, con-out_msg_pos.page_pos); #ifdef CONFIG_BLOCK @@ -958,13 +985,12 @@ static int write_partial_msg_pages(struc /* have we reached the trail part of the data? */ if (con-out_msg_pos.data_pos = data_len - trail_len) { - in_trail = 1; + in_trail = true; total_max_write = data_len - con-out_msg_pos.data_pos; page = list_first_entry(msg-trail-head, struct page, lru); - max_write = PAGE_SIZE; } else if (msg-pages) { page = msg-pages[con-out_msg_pos.page]; } else if (msg-pagelist) { @@ -988,14 +1014,14 @@ static int write_partial_msg_pages(struc if (do_datacrc !con-out_msg_pos.did_page_crc) { void *base; u32 crc; - u32 tmpcrc = le32_to_cpu(con-out_msg-footer.data_crc); + u32 tmpcrc = le32_to_cpu(msg-footer.data_crc); char *kaddr; kaddr = kmap(page); BUG_ON(kaddr == NULL); base = kaddr + con-out_msg_pos.page_pos + bio_offset; crc = crc32c(tmpcrc, base, len); - con-out_msg-footer.data_crc = cpu_to_le32(crc); + msg-footer.data_crc = cpu_to_le32(crc); con-out_msg_pos.did_page_crc = true; } ret = ceph_tcp_sendpage(con-sock, page, @@ -1008,30 +1034,14 @@ static int write_partial_msg_pages(struc if (ret = 0) goto out; - con-out_msg_pos.data_pos += ret; - con-out_msg_pos.page_pos += ret; - if (ret == len) { - con-out_msg_pos.page_pos = 0; - con-out_msg_pos.page++; - con-out_msg_pos.did_page_crc = false; - if (in_trail) - list_move_tail(page-lru, - msg-trail-head); - else if (msg-pagelist) - list_move_tail(page-lru, - msg-pagelist-head); -#ifdef CONFIG_BLOCK - else if (msg-bio) - iter_bio_next(msg-bio_iter, msg-bio_seg); -#endif - } + out_msg_pos_next(con, page, len, (size_t) ret, in_trail); } dout(write_partial_msg_pages %p msg %p done\n, con, msg); /* prepare and queue up footer, too */
[PATCH 4/6] libceph: move init_bio_*() functions up
Move init_bio_iter() and iter_bio_next() up in their source file so the'll be defined before they're needed. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c | 50 +- 1 file changed, 25 insertions(+), 25 deletions(-) Index: b/net/ceph/messenger.c === --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -590,6 +590,31 @@ static void con_out_kvec_add(struct ceph con-out_kvec_bytes += size; } +#ifdef CONFIG_BLOCK +static void init_bio_iter(struct bio *bio, struct bio **iter, int *seg) +{ + if (!bio) { + *iter = NULL; + *seg = 0; + return; + } + *iter = bio; + *seg = bio-bi_idx; +} + +static void iter_bio_next(struct bio **bio_iter, int *seg) +{ + if (*bio_iter == NULL) + return; + + BUG_ON(*seg = (*bio_iter)-bi_vcnt); + + (*seg)++; + if (*seg == (*bio_iter)-bi_vcnt) + init_bio_iter((*bio_iter)-bi_next, bio_iter, seg); +} +#endif + static void prepare_write_message_data(struct ceph_connection *con) { struct ceph_msg *msg = con-out_msg; @@ -892,31 +917,6 @@ out: return ret; /* done! */ } -#ifdef CONFIG_BLOCK -static void init_bio_iter(struct bio *bio, struct bio **iter, int *seg) -{ - if (!bio) { - *iter = NULL; - *seg = 0; - return; - } - *iter = bio; - *seg = bio-bi_idx; -} - -static void iter_bio_next(struct bio **bio_iter, int *seg) -{ - if (*bio_iter == NULL) - return; - - BUG_ON(*seg = (*bio_iter)-bi_vcnt); - - (*seg)++; - if (*seg == (*bio_iter)-bi_vcnt) - init_bio_iter((*bio_iter)-bi_next, bio_iter, seg); -} -#endif - static void out_msg_pos_next(struct ceph_connection *con, struct page *page, size_t len, size_t sent, bool in_trail) { -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] libceph: move init of bio_iter
If a message has a non-null bio pointer, its bio_iter field is initialized in write_partial_msg_pages() if this has not been done already. This is really a one-time setup operation for sending a message's (bio) data, so move that initialization code into prepare_write_message_data() which serves that purpose. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c |9 - 1 file changed, 4 insertions(+), 5 deletions(-) Index: b/net/ceph/messenger.c === --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -627,6 +627,10 @@ static void prepare_write_message_data(s con-out_msg_pos.page_pos = msg-page_alignment; else con-out_msg_pos.page_pos = 0; +#ifdef CONFIG_BLOCK + if (msg-bio !msg-bio_iter) + init_bio_iter(msg-bio, msg-bio_iter, msg-bio_seg); +#endif con-out_msg_pos.data_pos = 0; con-out_msg_pos.did_page_crc = false; con-out_more = 1; /* data + footer will follow */ @@ -966,11 +970,6 @@ static int write_partial_msg_pages(struc con, msg, con-out_msg_pos.page, msg-nr_pages, con-out_msg_pos.page_pos); -#ifdef CONFIG_BLOCK - if (msg-bio !msg-bio_iter) - init_bio_iter(msg-bio, msg-bio_iter, msg-bio_seg); -#endif - while (data_len con-out_msg_pos.data_pos) { struct page *page = NULL; int max_write = PAGE_SIZE; -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] libceph: don't use bio_iter as a flag
Recently a bug was fixed in which the bio_iter field in a ceph message was not being properly re-initialized when a message got re-transmitted: commit 43643528cce60ca184fe8197efa8e8da7c89a037 Author: Yan, Zheng zheng.z@intel.com rbd: Clear ceph_msg-bio_iter for retransmitted message We are now only initializing the bio_iter field when we are about to start to write message data (in prepare_write_message_data()), rather than every time we are attempting to write any portion of the message data (in write_partial_msg_pages()). This means we no longer need to use the msg-bio_iter field as a flag. So just don't do that any more. Trust prepare_write_message_data() to ensure msg-bio_iter is properly initialized, every time we are about to begin writing (or re-writing) a message's bio data. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c |6 +- 1 file changed, 1 insertion(+), 5 deletions(-) Index: b/net/ceph/messenger.c === --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -628,7 +628,7 @@ static void prepare_write_message_data(s else con-out_msg_pos.page_pos = 0; #ifdef CONFIG_BLOCK - if (msg-bio !msg-bio_iter) + if (msg-bio) init_bio_iter(msg-bio, msg-bio_iter, msg-bio_seg); #endif con-out_msg_pos.data_pos = 0; @@ -696,10 +696,6 @@ static void prepare_write_message(struct m-hdr.seq = cpu_to_le64(++con-out_seq); m-needs_out_seq = false; } -#ifdef CONFIG_BLOCK - else - m-bio_iter = NULL; -#endif dout(prepare_write_message %p seq %lld type %d len %d+%d+%d %d pgs\n, m, con-out_seq, le16_to_cpu(m-hdr.type), -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?
Hrm, look at your journal_queue_max_ops, journal_queue_max_bytes, op_queue_max_ops, and op_queue_max_bytes. Looks like you are set at 500 ops and a maximum of 100MB. With 1GigE you'd be able to max out the data in the journal really fast. Try tweaking these up and see what happens. test was made with 15 osd, each osd with 1GB journal. (so 1Gits = 100MB/S *3 with replication = 300MB /15 osd = 20MB/S (that around what I see with iostat) with 1GB journal , it should handle around 50s. I have redone a test, with 1 osd on 3 nodes with 8GB journal (write around 60-80MB/S on each osd) journal_queue_max_bytes show again 100MB journal_queue_max_ops = 500 but journal_ops = 6500 journal_queue_ops = 0 journal_queue_bytes = 0 (I have done perfcounters_dump each second and journal_queue_ops,journal_queue_bytes are always 0) op_queue_max_bytes:100MB op_queue_max_ops:500 (what are op_ counters ? osd counter ?) Should'nt be queues values as low as possible ? (0 queue = 0 bottleneck) ? root@cephtest1:/var/run/ceph# ceph --admin-daemon ceph-osd.0.asok perfcounters_dump {filestore:{journal_queue_max_ops:500,journal_queue_ops:0,journal_ops:6554,journal_queue_max_bytes:104857600,journal_queue_bytes:0,journal_bytes:6624795873,journal_latency:{avgcount:6554,sum:11.5094},op_queue_max_ops:500,op_queue_ops:0,ops:6554,op_queue_max_bytes:104857600,op_queue_bytes:0,bytes:6624755213,apply_latency:{avgcount:6554,sum:4462.6},committing:0,commitcycle:143,commitcycle_interval:{avgcount:143,sum:736.741},commitcycle_latency:{avgcount:143,sum:17.2976},journal_full:0},osd:{opq:0,op_wip:1,op:838,op_in_bytes:3514930636,op_out_bytes:0,op_latency:{avgcount:838,sum:201.494},op_r:0,op_r_out_bytes:0,op_r_latency:{avgcount:0,sum:0},op_w:838,op_w_in_bytes:3514930636,op_w_rlat:{avgcount:838,sum:0},op_w_latency:{avgcount:838,sum:201.494},op_rw:0,op_rw_in_bytes:0,op_rw_out_bytes:0,op_rw_rlat:{avgcount:0,sum:0},op_rw_latency:{avgcount:0,sum:0},subop:739,subop_in_bytes:3099988795,subop_latency:{avgcount:739,sum:45.7711},subop_w:0,subop_w_in_bytes:3099988795,subop_w_latency:{avgcount:739,sum:45.7711},subop_pull:0,subop_pull_latency:{avgcount:0,sum:0},subop_push:0,subop_push_in_bytes:0,subop_push_latency:{avgcount:0,sum:0},pull:0,push:0,push_out_bytes:0,recovery_ops:0,loadavg:0.56,buffer_bytes:0,numpg:1387,numpg_primary:701,numpg_replica:686,numpg_stray:0,heartbeat_to_peers:2,heartbeat_from_peers:0,map_messages:18,map_message_epochs:37,map_message_epoch_dups:31},throttle-filestore_bytes:{val:0,max:104857600,get:0,get_sum:0,get_or_fail_fail:0,get_or_fail_success:0,take:6554,take_sum:6624795873,put:6078,put_sum:6624795873,wait:{avgcount:0,sum:0}},throttle-filestore_ops:{val:0,max:500,get:0,get_sum:0,get_or_fail_fail:0,get_or_fail_success:0,take:6554,take_sum:6554,put:6078,put_sum:6554,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-client:{val:0,max:104857600,get:1076,get_sum:3523503185,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:1076,put_sum:3523503185,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-cluster:{val:0,max:104857600,get:5006,get_sum:3103900299,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:5006,put_sum:3103900299,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-hbclient:{val:0,max:104857600,get:478,get_sum:22466,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:478,put_sum:22466,wait:{avgcount:0,sum:0}},throttle-msgr_dispatch_throttler-hbserver:{val:0,max:104857600,get:484,get_sum:22748,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:484,put_sum:22748,wait:{avgcount:0,sum:0}},throttle-osd_client_bytes:{val:0,max:524288000,get:840,get_sum:3523353965,get_or_fail_fail:0,get_or_fail_success:0,take:0,take_sum:0,put:1679,put_sum:3523353965,wait:{avgcount:0,sum:0}}} - Mail original - De: Mark Nelson mark.nel...@inktank.com À: Alexandre DERUMIER aderum...@odiso.com Cc: ceph-devel@vger.kernel.org Envoyé: Lundi 18 Juin 2012 17:16:17 Objet: Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ? On 6/18/12 9:47 AM, Alexandre DERUMIER wrote: Yeah, I just wanted to make sure that the constant writes weren't because the filestore was falling behind. You may want to take a look at some of the information that is provided by the admin socket for the OSD while the test is running. dump_ops_in_flight, perf schema, and perf dump are all useful. don't know which values to check in these big json reponses ;) But I have try with more osd, so write are splitted on more disks and and write are smaller, and the behaviour is same No worries, there is a lot of data there! root@cephtest1:/var/run/ceph# ceph --admin-daemon ceph-osd.0.asok dump_ops_in_flight { num_ops: 1, ops: [ { description: osd_op(client.4179.0:83 kvmtest1_1006560_object82 [write 0~4194304] 3.9f5c55af), received_at: 2012-06-18 16:41:17.995167, age: 0.406678, flag_point: waiting for sub ops,
Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?
On Mon, Jun 18, 2012 at 5:34 AM, Alexandre DERUMIER aderum...@odiso.com wrote: I'm doing test with rados bench, and I see constant writes to osd disks. Is it the normal behaviour ? with write-ahead should write occur each 20-30 seconde ? Is the osd data filesystem perhaps doing atime updates? noatime is your friend. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ?
noatime and nodiratime are already enabled cat /etc/fstab /dev/sdb /srv/osd.0 xfs noatime,nodiratime 0 0 (drive was formatted simply with mkfs.xfs /dev/sdb) - Mail original - De: Tommi Virtanen t...@inktank.com À: Alexandre DERUMIER aderum...@odiso.com Cc: ceph-devel@vger.kernel.org Envoyé: Lundi 18 Juin 2012 18:01:52 Objet: Re: iostat show constants write to osd disk with writeahead journal, normal behaviour ? On Mon, Jun 18, 2012 at 5:34 AM, Alexandre DERUMIER aderum...@odiso.com wrote: I'm doing test with rados bench, and I see constant writes to osd disks. Is it the normal behaviour ? with write-ahead should write occur each 20-30 seconde ? Is the osd data filesystem perhaps doing atime updates? noatime is your friend. -- -- Alexandre D erumier Ingénieur Système Fixe : 03 20 68 88 90 Fax : 03 20 68 90 81 45 Bvd du Général Leclerc 59100 Roubaix - France 12 rue Marivaux 75002 Paris - France -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD layering design draft
On Fri, Jun 15, 2012 at 5:46 PM, Sage Weil s...@inktank.com wrote: Is 'preserve' and 'unpreserve' the verbiage we want to use here? Not sure I have a better suggestion, but preserve is unusual. protect/unprotect? The flag protects the image snapshot from being deleted. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sync Ceph packaging efforts for Debian/Ubuntu
On Mon, 18 Jun 2012, James Page wrote: Hi Sage/Laszlo Laszlo - thanks for sending the original email - I'd like to get everything as closely in-sync as possible between the three packaging sources as well. On 16/06/12 22:50, Sage Weil wrote: I've take a closer look at these patches, and have a few questions. - The URL change and nss patches I've applied; they are in the ceph.git 'debian' branch. Great! - Has the leveldb patch been sent upstream? Once it is committed to the upstream git, we can update ceph to use it; that's nicer than carrying the patch. However, I thought you needed to link against the existing libleveldb1 package... which means we shouldn't do anything on our side, right? I can't see any evidence that this has been sent upstream; ideally we would be building against libleveldb1 rather than using the embedded copy - I'm not familiar with the reason that this has not happened already (if there is one). This package would also need to be reviewed for inclusion in main if that was the case. We bundled it for expediency, that's all. I just send the patch off to the leveldb mailing list (in case that hadn't happened yet); we'll see if they apply it. - I'm not sure how useful it is to break mount.ceph and cephfs into a separate ceph-fs-common package, but we can do it. Same goes for a separate package for ceph-mds. That was originally motivated by ubuntu not wanting the mds in main, but in the end only the libraries went in, so it's a moot point. I'd rather hear from them what their intentions are for 12.10 before complicating things... ceph-fs-common is in Ubuntu main; so I think the original motivation still stands IMHO. Okay, split that part. For the Ubuntu quantal cycle we still have the same primary objective as we had during 12.04; namely ensuring that Ceph RBD can be used as a block store for qemu-kvm which ties nicely into the Ubuntu OpenStack story through Cinder; In addition we will be looking at Ceph RADOS as a backend for Glance (see [0] for more details). I'm reading this to meant hat you still want the mds separated out; did that too. The MIR for Ceph occurred quite late in the 12.04 cycle so we had to trim the scope to actually get it done; We will be looking at libfcgi and google-perftools this cycle for main inclusion to re-enable the components that are currently disabled in the Ubuntu packaging. Including those (and libleveldb1) would be ideal. - That same patch also switched all the Architecture: lines back to linux-any. Was that intentional? I just changed them from that last week. I think linux-any is correct - the change you have made would exclude the PPC architecture in Ubuntu and Debian. [...] Ben, James, can you please share in some sentences why ceph-fuse is dropped in Ubuntu? Do you need it Sage? If it's feasible, you may drop that as well. There is an outstanding question on the 12.04 MIR as to whether this package could still be built but not promoted to main - I'll follow up with the MIR reviewer as to whether that's possible as I don't think it requires any additional build dependencies. [...] I hope that explains the Ubuntu position on Ceph and what plans we have this development cycle. Okay, keep us posted! I pushed a new 'debian' branch with those changes; please take a look and let me know if it loks okay. Thanks- sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD layering design draft
On Fri, Jun 15, 2012 at 1:48 PM, Josh Durgin josh.dur...@inktank.com wrote: $ rbd unpreserve pool/image@snap Error unpreserving: child images rely on this image UX nit: this should also say what image it found. rbd: Cannot unpreserve: Still in use by pool2/image2 $ rbd list_children pool/image@snap pool2/child1 pool2/child2 How about just rbd children? Especially the underscore makes me unhappy. $ rbd copyup pool2/child1 Does copyup make sense to everyone? Every time you say it, my brain needs to flip the image inside the other way around -- I naturally imagine a tree with the parent at the top, and children and grandchildren down from it, but then I can't call that operation copyup without wrecking my mental image. I also can't seem to google good evidence that the term would be in widespread use in the enterprisey block storage world, outside of the unionfs world.. What do people call the un-dedupping, un-thinning of copy-on-write thin provisioning? unshare? In addition to knowing which parent a given image has, we want to be able to tell if a preserved image still has children. This is accomplished with a new per-pool object, `rbd_children`, which maps (parent pool, parent id, parent snapshot id) to a list of child image ids. So the omap value is a list, and you need to support atomic add/remove on the list members? Are you thinking of using an rbd class method that does read-modify-write for that? My instincts would have gone for (parent_pool, parent_id, parent_snapshot_id, child_id) - None, to get atomic operations for free. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD layering design draft
On 06/18/2012 10:00 AM, Tommi Virtanen wrote: On Fri, Jun 15, 2012 at 1:48 PM, Josh Durginjosh.dur...@inktank.com wrote: $ rbd unpreserve pool/image@snap Error unpreserving: child images rely on this image UX nit: this should also say what image it found. rbd: Cannot unpreserve: Still in use by pool2/image2 Agreed. $ rbd list_children pool/image@snap pool2/child1 pool2/child2 How about just rbd children? Especially the underscore makes me unhappy. Yeah, that sounds better. $ rbd copyup pool2/child1 Does copyup make sense to everyone? Every time you say it, my brain needs to flip the image inside the other way around -- I naturally imagine a tree with the parent at the top, and children and grandchildren down from it, but then I can't call that operation copyup without wrecking my mental image. I also can't seem to google good evidence that the term would be in widespread use in the enterprisey block storage world, outside of the unionfs world.. What do people call the un-dedupping, un-thinning of copy-on-write thin provisioning? unshare? I'm not sure what best term is, but there's probably something better than copyup. In addition to knowing which parent a given image has, we want to be able to tell if a preserved image still has children. This is accomplished with a new per-pool object, `rbd_children`, which maps (parent pool, parent id, parent snapshot id) to a list of child image ids. So the omap value is a list, and you need to support atomic add/remove on the list members? Are you thinking of using an rbd class method that does read-modify-write for that? My instincts would have gone for (parent_pool, parent_id, parent_snapshot_id, child_id) - None, to get atomic operations for free. The reason for making it a class method is more about hiding the implementation from clients. It could be the mapping you describe in an omap. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD hotplugging Chef cookbook (chef-1)
On Thu, Jun 14, 2012 at 6:32 AM, Danny Kukawka danny.kuka...@bisect.de wrote: And where can I find this branch? I've checked the git repo at: https://github.com/ceph/ceph-cookbooks But couldn't find any branch called ceph-1. Use master of ceph-cookbooks.git, and where those instructions said ceph-1, put in master (= use master of ceph.git). The instructions from that email are being distilled into proper documentation at http://ceph.com/docs/master/install/chef/ http://ceph.com/docs/master/config-cluster/chef/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ceph performance on Ubuntu Oneiric vs Ubuntu Precise
Hi Guys, I've been tracking down some performance issues over the past month with our internal test nodes and believe I have narrowed it down to something related to Ubuntu Oneiric. Tests done on nodes running Ubuntu Precise are significantly faster. One of the major differences between the releases is the support for syncfs in libc. Theoretically this shouldn't have a big effect on btrfs so I'm not totally sure that this is the culprit. Having said that, previous tests showed good SSD performance on Oneiric leading me to believe the lower latency mitigates the effect. Some of spinning disk seekwatcher results for Oneiric are quite strange with long periods of inactivity on the OSD data disks. I wanted to post these results for those of you who have had performance problems in the past. If you are continuing to have issues, you may want to try testing on precise and see if you notice any changes. It is possible that all of this could be specific to our internal testing nodes, so I wouldn't mind hearing if other people have seen similar behavior. These tests were done using rados bench with 16 concurrent requests. There are two nodes that each have a single 7200rpm OSD data disk and journal on a second 7200rpm disk. Replication is set at the default level (2). Kernel is 3.4 in all cases. Here's a run down (Numbers are MB/s) 4KB Requests BTRFS EXT4XFS Ceph 0.46/Oneiric: 0.073 0.694 0.723 Ceph 0.46/Precise: 2.152.031 1.546 Ceph 0.47.2/Oneiric:1.072 0.836 0.749 Ceph 0.47.2/Precise:2.566 2.579 1.498 128KB Requests: BTRFS EXT4XFS Ceph 0.46/Oneiric: 11.874 20.066 12.641 Ceph 0.46/Precise: 49.304 39.736 38.982 Ceph 0.47.2/Oneiric:13.81 19.05 12.739 Ceph 0.47.2/Precise:47.943 49.655 36.764 4MB Requests: BTRFS EXT4XFS Ceph 0.46/Oneiric: 110.202 26.58 15.445 Ceph 0.46/Precise: 135.975 128.759 106.426 Ceph 0.47.2/Oneiric:91.337 46.277 23.897 Ceph 0.47.2/Precise:136.906 134.955 106.545 I've posted seekwatcher results for all of the tests: Ceph 0.46/Oneiric: http://nhm.ceph.com/movies/sprint/test2 Ceph 0.46/Precise: http://nhm.ceph.com/movies/sprint/test3 Ceph 0.47.2/Oneiric:http://nhm.ceph.com/movies/sprint/test4 Ceph 0.47.2/Precise:http://nhm.ceph.com/movies/sprint/test5 Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD layering design draft
On Mon, 18 Jun 2012, Josh Durgin wrote: $ rbd copyup pool2/child1 Does copyup make sense to everyone? Every time you say it, my brain needs to flip the image inside the other way around -- I naturally imagine a tree with the parent at the top, and children and grandchildren down from it, but then I can't call that operation copyup without wrecking my mental image. I also can't seem to google good evidence that the term would be in widespread use in the enterprisey block storage world, outside of the unionfs world.. What do people call the un-dedupping, un-thinning of copy-on-write thin provisioning? unshare? I'm not sure what best term is, but there's probably something better than copyup. flatten? My mental model is stuck on the layering analogy, where the child is a copy-on-write layer on top of a read-only parent. Someday we may want to support the ability to add a parent to an existing image and do a sort of dedup, so having an opposite for whatever term we pick would be a bonus. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph performance on Ubuntu Oneiric vs Ubuntu Precise
Do I correctly assume that these nodes hosted only the OSDs, and the monitors were on a separate node? On Mon, Jun 18, 2012 at 10:56 AM, Mark Nelson mark.nel...@inktank.com wrote: Hi Guys, I've been tracking down some performance issues over the past month with our internal test nodes and believe I have narrowed it down to something related to Ubuntu Oneiric. Tests done on nodes running Ubuntu Precise are significantly faster. One of the major differences between the releases is the support for syncfs in libc. Theoretically this shouldn't have a big effect on btrfs so I'm not totally sure that this is the culprit. Having said that, previous tests showed good SSD performance on Oneiric leading me to believe the lower latency mitigates the effect. Some of spinning disk seekwatcher results for Oneiric are quite strange with long periods of inactivity on the OSD data disks. I wanted to post these results for those of you who have had performance problems in the past. If you are continuing to have issues, you may want to try testing on precise and see if you notice any changes. It is possible that all of this could be specific to our internal testing nodes, so I wouldn't mind hearing if other people have seen similar behavior. These tests were done using rados bench with 16 concurrent requests. There are two nodes that each have a single 7200rpm OSD data disk and journal on a second 7200rpm disk. Replication is set at the default level (2). Kernel is 3.4 in all cases. Here's a run down (Numbers are MB/s) 4KB Requests BTRFS EXT4 XFS Ceph 0.46/Oneiric: 0.073 0.694 0.723 Ceph 0.46/Precise: 2.15 2.031 1.546 Ceph 0.47.2/Oneiric: 1.072 0.836 0.749 Ceph 0.47.2/Precise: 2.566 2.579 1.498 128KB Requests: BTRFS EXT4 XFS Ceph 0.46/Oneiric: 11.874 20.066 12.641 Ceph 0.46/Precise: 49.304 39.736 38.982 Ceph 0.47.2/Oneiric: 13.81 19.05 12.739 Ceph 0.47.2/Precise: 47.943 49.655 36.764 4MB Requests: BTRFS EXT4 XFS Ceph 0.46/Oneiric: 110.202 26.58 15.445 Ceph 0.46/Precise: 135.975 128.759 106.426 Ceph 0.47.2/Oneiric: 91.337 46.277 23.897 Ceph 0.47.2/Precise: 136.906 134.955 106.545 I've posted seekwatcher results for all of the tests: Ceph 0.46/Oneiric: http://nhm.ceph.com/movies/sprint/test2 Ceph 0.46/Precise: http://nhm.ceph.com/movies/sprint/test3 Ceph 0.47.2/Oneiric: http://nhm.ceph.com/movies/sprint/test4 Ceph 0.47.2/Precise: http://nhm.ceph.com/movies/sprint/test5 Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD layering design draft
Locking is a separate mechanism we're already working on, which will lock images so that they can't accidentally be mounted at more than one location. :) -Greg On Sun, Jun 17, 2012 at 6:42 AM, Martin Mailand mar...@tuxadero.com wrote: Hi, what's up locked, unlocked, unlocking? -martin Am 16.06.2012 17:11, schrieb Sage Weil: On Fri, 15 Jun 2012, Yehuda Sadeh wrote: On Fri, Jun 15, 2012 at 5:46 PM, Sage Weils...@inktank.com wrote: Looks good! Couple small things: $ rbd unpreserve pool/image@snap Is 'preserve' and 'unpreserve' the verbiage we want to use here? Not sure I have a better suggestion, but preserve is unusual. freeze, thaw/unfreeze? Freeze/thaw usually mean something like quiesce I/O or read-only, usually temporarily. What we actaully mean is you can't delete this. Maybe pin/unpin? preserve/unpreserve may be fine, too! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph performance on Ubuntu Oneiric vs Ubuntu Precise
Hi Greg, Yep, 3 monitors each on their own node. Mark On 06/18/2012 01:04 PM, Gregory Farnum wrote: Do I correctly assume that these nodes hosted only the OSDs, and the monitors were on a separate node? On Mon, Jun 18, 2012 at 10:56 AM, Mark Nelsonmark.nel...@inktank.com wrote: Hi Guys, I've been tracking down some performance issues over the past month with our internal test nodes and believe I have narrowed it down to something related to Ubuntu Oneiric. Tests done on nodes running Ubuntu Precise are significantly faster. One of the major differences between the releases is the support for syncfs in libc. Theoretically this shouldn't have a big effect on btrfs so I'm not totally sure that this is the culprit. Having said that, previous tests showed good SSD performance on Oneiric leading me to believe the lower latency mitigates the effect. Some of spinning disk seekwatcher results for Oneiric are quite strange with long periods of inactivity on the OSD data disks. I wanted to post these results for those of you who have had performance problems in the past. If you are continuing to have issues, you may want to try testing on precise and see if you notice any changes. It is possible that all of this could be specific to our internal testing nodes, so I wouldn't mind hearing if other people have seen similar behavior. These tests were done using rados bench with 16 concurrent requests. There are two nodes that each have a single 7200rpm OSD data disk and journal on a second 7200rpm disk. Replication is set at the default level (2). Kernel is 3.4 in all cases. Here's a run down (Numbers are MB/s) 4KB Requests BTRFS EXT4XFS Ceph 0.46/Oneiric: 0.073 0.694 0.723 Ceph 0.46/Precise: 2.152.031 1.546 Ceph 0.47.2/Oneiric:1.072 0.836 0.749 Ceph 0.47.2/Precise:2.566 2.579 1.498 128KB Requests: BTRFS EXT4XFS Ceph 0.46/Oneiric: 11.874 20.066 12.641 Ceph 0.46/Precise: 49.304 39.736 38.982 Ceph 0.47.2/Oneiric:13.81 19.05 12.739 Ceph 0.47.2/Precise:47.943 49.655 36.764 4MB Requests: BTRFS EXT4XFS Ceph 0.46/Oneiric: 110.202 26.58 15.445 Ceph 0.46/Precise: 135.975 128.759 106.426 Ceph 0.47.2/Oneiric:91.337 46.277 23.897 Ceph 0.47.2/Precise:136.906 134.955 106.545 I've posted seekwatcher results for all of the tests: Ceph 0.46/Oneiric: http://nhm.ceph.com/movies/sprint/test2 Ceph 0.46/Precise: http://nhm.ceph.com/movies/sprint/test3 Ceph 0.47.2/Oneiric:http://nhm.ceph.com/movies/sprint/test4 Ceph 0.47.2/Precise:http://nhm.ceph.com/movies/sprint/test5 Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph performance on Ubuntu Oneiric vs Ubuntu Precise
On Mon, Jun 18, 2012 at 10:56 AM, Mark Nelson mark.nel...@inktank.com wrote: I've been tracking down some performance issues over the past month with our internal test nodes and believe I have narrowed it down to something related to Ubuntu Oneiric. Tests done on nodes running Ubuntu Precise are significantly faster. Did you use Ubuntu kernels or our own builds? Different/same across runs? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph performance on Ubuntu Oneiric vs Ubuntu Precise
On 06/18/2012 01:12 PM, Tommi Virtanen wrote: On Mon, Jun 18, 2012 at 10:56 AM, Mark Nelsonmark.nel...@inktank.com wrote: I've been tracking down some performance issues over the past month with our internal test nodes and believe I have narrowed it down to something related to Ubuntu Oneiric. Tests done on nodes running Ubuntu Precise are significantly faster. Did you use Ubuntu kernels or our own builds? Different/same across runs? Each set of nodes is using our kernel from gitbuilder: http://gitbuilder.ceph.com/kernel-deb-oneiric-x86_64-basic/ref/v3.4/ and http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/v3.4/ respectively. I should note that this problem was also seen on kernel 3.3 from gitbuilder with oneiric, though I do not have comparative numbers available. Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Heavy speed difference between rbd and custom pool
Hello list, i'm getting these rbd bench values for pool rbd. They're high and constant. - RBD pool # rados -p rbd bench 30 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 30 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 274 258 1031.77 1032 0.043758 0.0602236 2 16 549 533 1065.82 1100 0.072168 0.0590944 3 16 825 8091078.5 1104 0.040162 0.058682 4 16 1103 1087 1086.84 1112 0.052508 0.0584277 5 16 1385 1369 1095.04 1128 0.060233 0.0581288 6 16 1654 1638 1091.85 1076 0.050697 0.0583385 7 16 1939 1923 1098.71 1140 0.063716 0.057964 8 16 2219 2203 1101.35 1120 0.055435 0.0579105 9 16 2497 2481 1102.52 1112 0.060413 0.0578282 10 16 2773 2757 1102.66 1104 0.051134 0.0578561 11 16 3049 3033 1102.77 1104 0.057742 0.0578803 12 16 3326 3310 1103.19 1108 0.053769 0.0578627 13 16 3604 3588 1103.86 1112 0.064574 0.0578453 14 16 3883 3867 1104.72 1116 0.056524 0.0578018 15 16 4162 4146 1105.46 1116 0.054581 0.0577626 16 16 4440 4424 1105.86 1112 0.079015 0.057758 17 16 4725 4709 1107.86 1140 0.043511 0.0576647 18 16 5007 4991 1108.97 1128 0.053005 0.0576147 19 16 5292 52761110.6 1140 0.069004 0.057538 2012-06-18 23:36:19.124472min lat: 0.028568 max lat: 0.201941 avg lat: 0.0574953 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 5574 5558 .46 1128 0.048482 0.0574953 21 16 5861 5845 1113.18 1148 0.051923 0.0574146 22 16 6147 6131 1114.58 1144 0.04461 0.0573461 23 16 6438 6422 1116.72 1164 0.050383 0.0572406 24 16 6724 6708 1117.85 1144 0.067827 0.0571864 25 16 7008 6992 1118.57 1136 0.049128 0.057147 26 16 7296 7280 1119.85 1152 0.050331 0.0570879 27 16 7573 75571119.4 1108 0.052711 0.0571132 28 16 7858 7842 1120.13 1140 0.056369 0.0570764 29 16 8143 8127 1120.81 1140 0.046558 0.0570438 30 16 8431 8415 1121.85 1152 0.049958 0.0569942 Total time run: 30.045481 Total writes made: 8431 Write size: 4194304 Bandwidth (MB/sec): 1122.432 Stddev Bandwidth: 26.0451 Max bandwidth (MB/sec): 1164 Min bandwidth (MB/sec): 1032 Average Latency:0.0570069 Stddev Latency: 0.0128039 Max latency:0.235536 Min latency:0.028568 - I created then a custom pool called kvmpool. ~# ceph osd pool create kvmpool pool 'kvmpool' created But with this one i get slow and jumping values: kvmpool ~# rados -p kvmpool bench 30 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 30 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 231 215 859.863 860 0.204867 0.069195 2 16 393 377 753.899 648 0.049444 0.0811933 3 16 535 519 691.908 568 0.232365 0.0899074 4 16 634 618 617.913 396 0.032758 0.0963399 5 16 806 790 631.913 688 0.075811 0.099529 6 16 948 932 621.249 568 0.156988 0.10179 7 16 1086 1070 611.348 552 0.036177 0.102064 8 16 1206 1190 594.922 480 0.028491 0.105235 9 16 1336 1320 586.589 520 0.041009 0.108735 10 16 1512 1496598.32 704 0.258165 0.105086 11 16 1666 1650 599.921 616 0.040967 0.106146 12 15 1825 1810 603.255 640 0.198851 0.105463 13 16 1925 1909 587.309 396 0.042577 0.108449 14 16 2135 2119 605.352 840 0.035767 0.105219 15 16 2272 2256 601.523 548 0.246136 0.105357 16 16 2426 2410 602.424 616 0.19881 0.105692 17 16 2529 2513591.22 412 0.031322 0.105463 18 16 2696 2680595.48 668 0.028081 0.106749
Re: Possible deadlock condition
Here is, perhaps, a more useful traceback from a different run of tests that we just ran into: Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.680815] INFO: task flush-254:0:29582 blocked for more than 120 seconds. Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681040] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681458] flush-254:0 D 880bd9ca2fc0 0 29582 2 0x Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681740] 88006e51d160 0046 0002 88061b362040 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.682173] 88006e51d160 000120c0 000120c0 000120c0 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.682659] 88006e51dfd8 000120c0 000120c0 88006e51dfd8 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683088] Call Trace: Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683302] [81520132] schedule+0x5a/0x5c Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683514] [815203e7] schedule_timeout+0x36/0xe3 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683784] [8101e0b2] ? physflat_send_IPI_mask+0xe/0x10 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683999] [8101a237] ? native_smp_send_reschedule+0x46/0x48 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684219] [811e0071] ? list_move_tail+0x27/0x2c Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684432] [81520d13] __down_common+0x90/0xd4 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684708] [811e1120] ? _xfs_buf_find+0x17f/0x210 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684925] [81520dca] __down+0x1d/0x1f Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685139] [8105db4e] down+0x2d/0x3d Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685350] [811e0f68] xfs_buf_lock+0x76/0xaf Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685565] [811e1120] _xfs_buf_find+0x17f/0x210 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685836] [811e13b6] xfs_buf_get+0x2a/0x177 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686052] [811e19f6] xfs_buf_read+0x1f/0xca Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686270] [8122a0b7] xfs_trans_read_buf+0x205/0x308 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686490] [81205e01] xfs_btree_read_buf_block.clone.22+0x4f/0xa7 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687015] [8122a3ee] ? xfs_trans_log_buf+0xb2/0xc1 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687232] [81205edd] xfs_btree_lookup_get_block+0x84/0xac Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687449] [81208e83] xfs_btree_lookup+0x12b/0x3dc Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687721] [811f6bb2] ? xfs_alloc_vextent+0x447/0x469 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687939] [811fd171] xfs_bmbt_lookup_eq+0x1f/0x21 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688156] [811ffa88] xfs_bmap_add_extent_delay_real+0x5b5/0xfec Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688378] [810f155b] ? kmem_cache_alloc+0x87/0xf3 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688650] [81204c40] ? xfs_bmbt_init_cursor+0x3f/0x107 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688867] [81201160] xfs_bmapi_allocate+0x1f6/0x23a Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689084] [812185bd] ? xfs_iext_bno_to_irec+0x95/0xb9 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689301] [81203414] xfs_bmapi_write+0x32d/0x5a2 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689519] [811e99e4] xfs_iomap_write_allocate+0x1a5/0x29f Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689797] [811df12a] xfs_map_blocks+0x13e/0x1dd Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690016] [811dfbff] xfs_vm_writepage+0x24e/0x410 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690233] [810bde1e] __writepage+0x17/0x30 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690446] [810be6ed] write_cache_pages+0x276/0x3c8 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690693] [810bde07] ? set_page_dirty+0x60/0x60 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690908] [810be884] generic_writepages+0x45/0x5c Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691123] [811defcb] xfs_vm_writepages+0x4d/0x54 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691337] [810bf832] do_writepages+0x21/0x2a Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691552] [811218f5] writeback_single_inode+0x12a/0x2cc Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691800] [81121d92] writeback_sb_inodes+0x174/0x215 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.692016] [81122185] __writeback_inodes_wb+0x78/0xb9 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.692231] [811224b5]
Re: Heavy speed difference between rbd and custom pool
On 06/18/2012 04:39 PM, Stefan Priebe wrote: Hello list, i'm getting these rbd bench values for pool rbd. They're high and constant. - RBD pool # rados -p rbd bench 30 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 30 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 274 258 1031.77 1032 0.043758 0.0602236 2 16 549 533 1065.82 1100 0.072168 0.0590944 3 16 825 809 1078.5 1104 0.040162 0.058682 4 16 1103 1087 1086.84 1112 0.052508 0.0584277 5 16 1385 1369 1095.04 1128 0.060233 0.0581288 6 16 1654 1638 1091.85 1076 0.050697 0.0583385 7 16 1939 1923 1098.71 1140 0.063716 0.057964 8 16 2219 2203 1101.35 1120 0.055435 0.0579105 9 16 2497 2481 1102.52 1112 0.060413 0.0578282 10 16 2773 2757 1102.66 1104 0.051134 0.0578561 11 16 3049 3033 1102.77 1104 0.057742 0.0578803 12 16 3326 3310 1103.19 1108 0.053769 0.0578627 13 16 3604 3588 1103.86 1112 0.064574 0.0578453 14 16 3883 3867 1104.72 1116 0.056524 0.0578018 15 16 4162 4146 1105.46 1116 0.054581 0.0577626 16 16 4440 4424 1105.86 1112 0.079015 0.057758 17 16 4725 4709 1107.86 1140 0.043511 0.0576647 18 16 5007 4991 1108.97 1128 0.053005 0.0576147 19 16 5292 5276 1110.6 1140 0.069004 0.057538 2012-06-18 23:36:19.124472min lat: 0.028568 max lat: 0.201941 avg lat: 0.0574953 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 5574 5558 .46 1128 0.048482 0.0574953 21 16 5861 5845 1113.18 1148 0.051923 0.0574146 22 16 6147 6131 1114.58 1144 0.04461 0.0573461 23 16 6438 6422 1116.72 1164 0.050383 0.0572406 24 16 6724 6708 1117.85 1144 0.067827 0.0571864 25 16 7008 6992 1118.57 1136 0.049128 0.057147 26 16 7296 7280 1119.85 1152 0.050331 0.0570879 27 16 7573 7557 1119.4 1108 0.052711 0.0571132 28 16 7858 7842 1120.13 1140 0.056369 0.0570764 29 16 8143 8127 1120.81 1140 0.046558 0.0570438 30 16 8431 8415 1121.85 1152 0.049958 0.0569942 Total time run: 30.045481 Total writes made: 8431 Write size: 4194304 Bandwidth (MB/sec): 1122.432 Stddev Bandwidth: 26.0451 Max bandwidth (MB/sec): 1164 Min bandwidth (MB/sec): 1032 Average Latency: 0.0570069 Stddev Latency: 0.0128039 Max latency: 0.235536 Min latency: 0.028568 - I created then a custom pool called kvmpool. ~# ceph osd pool create kvmpool pool 'kvmpool' created But with this one i get slow and jumping values: kvmpool ~# rados -p kvmpool bench 30 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 30 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 231 215 859.863 860 0.204867 0.069195 2 16 393 377 753.899 648 0.049444 0.0811933 3 16 535 519 691.908 568 0.232365 0.0899074 4 16 634 618 617.913 396 0.032758 0.0963399 5 16 806 790 631.913 688 0.075811 0.099529 6 16 948 932 621.249 568 0.156988 0.10179 7 16 1086 1070 611.348 552 0.036177 0.102064 8 16 1206 1190 594.922 480 0.028491 0.105235 9 16 1336 1320 586.589 520 0.041009 0.108735 10 16 1512 1496 598.32 704 0.258165 0.105086 11 16 1666 1650 599.921 616 0.040967 0.106146 12 15 1825 1810 603.255 640 0.198851 0.105463 13 16 1925 1909 587.309 396 0.042577 0.108449 14 16 2135 2119 605.352 840 0.035767 0.105219 15 16 2272 2256 601.523 548 0.246136 0.105357 16 16 2426 2410 602.424 616 0.19881 0.105692 17 16 2529 2513 591.22 412 0.031322 0.105463 18 16 2696 2680 595.48 668 0.028081 0.106749 19 16 2878 2862 602.449 728 0.044929 0.105856 2012-06-18 23:38:45.566094min lat: 0.023295 max lat: 0.763797 avg lat: 0.105597 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 3041 3025 604.921 652 0.036028 0.105597 21 16 3182 3166 602.964 564 0.035072 0.104915 22 16 3349 605.916 668 0.030493 0.105304 23 16 3512 3496 607.917 652 0.030523 0.10479 24 16 3668 3652 608.584 624 0.232933 0.10475 25 16 3821 3805 608.717 612 0.029881 0.104513 26 16 3963 3947 607.148 568 0.050244 0.10531 27 16 4112 4096 606.733 596 0.259069 0.105008 28 16 4261 4245 606.347 596 0.211877 0.105215 29 16 4437 4421 609.712 704 0.02802 0.104613 30 16 4566 4550 606.586 516 0.047076 0.105111 Total time run: 30.062141 Total writes made: 4566 Write size: 4194304 Bandwidth (MB/sec): 607.542 Stddev Bandwidth: 109.112 Max bandwidth (MB/sec): 860 Min bandwidth (MB/sec): 396 Average Latency: 0.10532 Stddev Latency: 0.108369 Max latency: 0.763797 Min latency: 0.023295 Why do these pools differ? Where is the difference? Stefan Are the number of placement groups the same for each pool? try running ceph osd dump -o - | grep pool and looking for the pg_num value. Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Heavy speed difference between rbd and custom pool
Yes, this is almost certainly the problem. When you create the pool, you can specify a pg count; the default is 8, which is quite low. The count can't currently be adjusted after pool-creation time (we're working on an enhancement for that). http://ceph.com/docs/master/control/ shows ceph osd pool create POOL [pg_num [pgp_num]] You'll want to set pg_num the same for similar pools in order to get for similar pool performance. I note also that you can get that field directlty: $ ceph osd pool get rbd pg_num PG_NUM: 448 I have a 'nova' pool that was created with pool create: $ ceph osd pool get nova pg_num PG_NUM: 8 On 06/18/2012 03:23 PM, Mark Nelson wrote: On 06/18/2012 04:39 PM, Stefan Priebe wrote: Hello list, i'm getting these rbd bench values for pool rbd. They're high and constant. - RBD pool # rados -p rbd bench 30 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 30 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 274 258 1031.77 1032 0.043758 0.0602236 2 16 549 533 1065.82 1100 0.072168 0.0590944 3 16 825 809 1078.5 1104 0.040162 0.058682 4 16 1103 1087 1086.84 1112 0.052508 0.0584277 5 16 1385 1369 1095.04 1128 0.060233 0.0581288 6 16 1654 1638 1091.85 1076 0.050697 0.0583385 7 16 1939 1923 1098.71 1140 0.063716 0.057964 8 16 2219 2203 1101.35 1120 0.055435 0.0579105 9 16 2497 2481 1102.52 1112 0.060413 0.0578282 10 16 2773 2757 1102.66 1104 0.051134 0.0578561 11 16 3049 3033 1102.77 1104 0.057742 0.0578803 12 16 3326 3310 1103.19 1108 0.053769 0.0578627 13 16 3604 3588 1103.86 1112 0.064574 0.0578453 14 16 3883 3867 1104.72 1116 0.056524 0.0578018 15 16 4162 4146 1105.46 1116 0.054581 0.0577626 16 16 4440 4424 1105.86 1112 0.079015 0.057758 17 16 4725 4709 1107.86 1140 0.043511 0.0576647 18 16 5007 4991 1108.97 1128 0.053005 0.0576147 19 16 5292 5276 1110.6 1140 0.069004 0.057538 2012-06-18 23:36:19.124472min lat: 0.028568 max lat: 0.201941 avg lat: 0.0574953 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 5574 5558 .46 1128 0.048482 0.0574953 21 16 5861 5845 1113.18 1148 0.051923 0.0574146 22 16 6147 6131 1114.58 1144 0.04461 0.0573461 23 16 6438 6422 1116.72 1164 0.050383 0.0572406 24 16 6724 6708 1117.85 1144 0.067827 0.0571864 25 16 7008 6992 1118.57 1136 0.049128 0.057147 26 16 7296 7280 1119.85 1152 0.050331 0.0570879 27 16 7573 7557 1119.4 1108 0.052711 0.0571132 28 16 7858 7842 1120.13 1140 0.056369 0.0570764 29 16 8143 8127 1120.81 1140 0.046558 0.0570438 30 16 8431 8415 1121.85 1152 0.049958 0.0569942 Total time run: 30.045481 Total writes made: 8431 Write size: 4194304 Bandwidth (MB/sec): 1122.432 Stddev Bandwidth: 26.0451 Max bandwidth (MB/sec): 1164 Min bandwidth (MB/sec): 1032 Average Latency: 0.0570069 Stddev Latency: 0.0128039 Max latency: 0.235536 Min latency: 0.028568 - I created then a custom pool called kvmpool. ~# ceph osd pool create kvmpool pool 'kvmpool' created But with this one i get slow and jumping values: kvmpool ~# rados -p kvmpool bench 30 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 30 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 231 215 859.863 860 0.204867 0.069195 2 16 393 377 753.899 648 0.049444 0.0811933 3 16 535 519 691.908 568 0.232365 0.0899074 4 16 634 618 617.913 396 0.032758 0.0963399 5 16 806 790 631.913 688 0.075811 0.099529 6 16 948 932 621.249 568 0.156988 0.10179 7 16 1086 1070 611.348 552 0.036177 0.102064 8 16 1206 1190 594.922 480 0.028491 0.105235 9 16 1336 1320 586.589 520 0.041009 0.108735 10 16 1512 1496 598.32 704 0.258165 0.105086 11 16 1666 1650 599.921 616 0.040967 0.106146 12 15 1825 1810 603.255 640 0.198851 0.105463 13 16 1925 1909 587.309 396 0.042577 0.108449 14 16 2135 2119 605.352 840 0.035767 0.105219 15 16 2272 2256 601.523 548 0.246136 0.105357 16 16 2426 2410 602.424 616 0.19881 0.105692 17 16 2529 2513 591.22 412 0.031322 0.105463 18 16 2696 2680 595.48 668 0.028081 0.106749 19 16 2878 2862 602.449 728 0.044929 0.105856 2012-06-18 23:38:45.566094min lat: 0.023295 max lat: 0.763797 avg lat: 0.105597 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 3041 3025 604.921 652 0.036028 0.105597 21 16 3182 3166 602.964 564 0.035072 0.104915 22 16 3349 605.916 668 0.030493 0.105304 23 16 3512 3496 607.917 652 0.030523 0.10479 24 16 3668 3652 608.584 624 0.232933 0.10475 25 16 3821 3805 608.717 612 0.029881 0.104513 26 16 3963 3947 607.148 568 0.050244 0.10531 27 16 4112 4096 606.733 596 0.259069 0.105008 28 16 4261 4245 606.347 596 0.211877 0.105215 29 16 4437 4421 609.712 704 0.02802 0.104613 30 16 4566 4550 606.586 516 0.047076 0.105111 Total time run: 30.062141 Total writes made: 4566 Write size: 4194304 Bandwidth (MB/sec): 607.542 Stddev Bandwidth: 109.112 Max bandwidth (MB/sec): 860 Min bandwidth (MB/sec): 396 Average
Re: Possible deadlock condition
Does the xfs on the OSD have plenty of free space left, or could this be an allocation deadlock? On 06/18/2012 03:17 PM, Mandell Degerness wrote: Here is, perhaps, a more useful traceback from a different run of tests that we just ran into: Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.680815] INFO: task flush-254:0:29582 blocked for more than 120 seconds. Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681040] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681458] flush-254:0 D 880bd9ca2fc0 0 29582 2 0x Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681740] 88006e51d160 0046 0002 88061b362040 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.682173] 88006e51d160 000120c0 000120c0 000120c0 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.682659] 88006e51dfd8 000120c0 000120c0 88006e51dfd8 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683088] Call Trace: Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683302] [81520132] schedule+0x5a/0x5c Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683514] [815203e7] schedule_timeout+0x36/0xe3 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683784] [8101e0b2] ? physflat_send_IPI_mask+0xe/0x10 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683999] [8101a237] ? native_smp_send_reschedule+0x46/0x48 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684219] [811e0071] ? list_move_tail+0x27/0x2c Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684432] [81520d13] __down_common+0x90/0xd4 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684708] [811e1120] ? _xfs_buf_find+0x17f/0x210 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684925] [81520dca] __down+0x1d/0x1f Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685139] [8105db4e] down+0x2d/0x3d Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685350] [811e0f68] xfs_buf_lock+0x76/0xaf Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685565] [811e1120] _xfs_buf_find+0x17f/0x210 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685836] [811e13b6] xfs_buf_get+0x2a/0x177 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686052] [811e19f6] xfs_buf_read+0x1f/0xca Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686270] [8122a0b7] xfs_trans_read_buf+0x205/0x308 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686490] [81205e01] xfs_btree_read_buf_block.clone.22+0x4f/0xa7 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687015] [8122a3ee] ? xfs_trans_log_buf+0xb2/0xc1 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687232] [81205edd] xfs_btree_lookup_get_block+0x84/0xac Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687449] [81208e83] xfs_btree_lookup+0x12b/0x3dc Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687721] [811f6bb2] ? xfs_alloc_vextent+0x447/0x469 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687939] [811fd171] xfs_bmbt_lookup_eq+0x1f/0x21 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688156] [811ffa88] xfs_bmap_add_extent_delay_real+0x5b5/0xfec Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688378] [810f155b] ? kmem_cache_alloc+0x87/0xf3 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688650] [81204c40] ? xfs_bmbt_init_cursor+0x3f/0x107 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688867] [81201160] xfs_bmapi_allocate+0x1f6/0x23a Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689084] [812185bd] ? xfs_iext_bno_to_irec+0x95/0xb9 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689301] [81203414] xfs_bmapi_write+0x32d/0x5a2 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689519] [811e99e4] xfs_iomap_write_allocate+0x1a5/0x29f Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689797] [811df12a] xfs_map_blocks+0x13e/0x1dd Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690016] [811dfbff] xfs_vm_writepage+0x24e/0x410 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690233] [810bde1e] __writepage+0x17/0x30 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690446] [810be6ed] write_cache_pages+0x276/0x3c8 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690693] [810bde07] ? set_page_dirty+0x60/0x60 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690908] [810be884] generic_writepages+0x45/0x5c Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691123] [811defcb] xfs_vm_writepages+0x4d/0x54 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691337] [810bf832] do_writepages+0x21/0x2a Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691552] [811218f5] writeback_single_inode+0x12a/0x2cc Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691800] [81121d92] writeback_sb_inodes+0x174/0x215 Jun 18 17:58:51 node-172-29-0-15 kernel:
Re: RBD layering design draft
On 06/18/2012 11:01 AM, Sage Weil wrote: On Mon, 18 Jun 2012, Josh Durgin wrote: $ rbd copyup pool2/child1 Does copyup make sense to everyone? Every time you say it, my brain needs to flip the image inside the other way around -- I naturally imagine a tree with the parent at the top, and children and grandchildren down from it, but then I can't call that operation copyup without wrecking my mental image. I also can't seem to google good evidence that the term would be in widespread use in the enterprisey block storage world, outside of the unionfs world.. What do people call the un-dedupping, un-thinning of copy-on-write thin provisioning? unshare? I'm not sure what best term is, but there's probably something better than copyup. flatten? My mental model is stuck on the layering analogy, where the child is a copy-on-write layer on top of a read-only parent. Someday we may want to support the ability to add a parent to an existing image and do a sort of dedup, so having an opposite for whatever term we pick would be a bonus. disown and adopt? :) (actually I started as a joke, but really I kinda like that; fits with the parent-child name) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible deadlock condition
None of the OSDs seem to be more than 82% full. I didn't think we were running quite that close to the margin, but it is still far from actually full. On Mon, Jun 18, 2012 at 3:57 PM, Dan Mick dan.m...@inktank.com wrote: Does the xfs on the OSD have plenty of free space left, or could this be an allocation deadlock? On 06/18/2012 03:17 PM, Mandell Degerness wrote: Here is, perhaps, a more useful traceback from a different run of tests that we just ran into: Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.680815] INFO: task flush-254:0:29582 blocked for more than 120 seconds. Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681040] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681458] flush-254:0 D 880bd9ca2fc0 0 29582 2 0x Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681740] 88006e51d160 0046 0002 88061b362040 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.682173] 88006e51d160 000120c0 000120c0 000120c0 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.682659] 88006e51dfd8 000120c0 000120c0 88006e51dfd8 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683088] Call Trace: Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683302] [81520132] schedule+0x5a/0x5c Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683514] [815203e7] schedule_timeout+0x36/0xe3 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683784] [8101e0b2] ? physflat_send_IPI_mask+0xe/0x10 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683999] [8101a237] ? native_smp_send_reschedule+0x46/0x48 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684219] [811e0071] ? list_move_tail+0x27/0x2c Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684432] [81520d13] __down_common+0x90/0xd4 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684708] [811e1120] ? _xfs_buf_find+0x17f/0x210 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684925] [81520dca] __down+0x1d/0x1f Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685139] [8105db4e] down+0x2d/0x3d Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685350] [811e0f68] xfs_buf_lock+0x76/0xaf Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685565] [811e1120] _xfs_buf_find+0x17f/0x210 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685836] [811e13b6] xfs_buf_get+0x2a/0x177 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686052] [811e19f6] xfs_buf_read+0x1f/0xca Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686270] [8122a0b7] xfs_trans_read_buf+0x205/0x308 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686490] [81205e01] xfs_btree_read_buf_block.clone.22+0x4f/0xa7 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687015] [8122a3ee] ? xfs_trans_log_buf+0xb2/0xc1 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687232] [81205edd] xfs_btree_lookup_get_block+0x84/0xac Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687449] [81208e83] xfs_btree_lookup+0x12b/0x3dc Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687721] [811f6bb2] ? xfs_alloc_vextent+0x447/0x469 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687939] [811fd171] xfs_bmbt_lookup_eq+0x1f/0x21 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688156] [811ffa88] xfs_bmap_add_extent_delay_real+0x5b5/0xfec Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688378] [810f155b] ? kmem_cache_alloc+0x87/0xf3 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688650] [81204c40] ? xfs_bmbt_init_cursor+0x3f/0x107 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688867] [81201160] xfs_bmapi_allocate+0x1f6/0x23a Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689084] [812185bd] ? xfs_iext_bno_to_irec+0x95/0xb9 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689301] [81203414] xfs_bmapi_write+0x32d/0x5a2 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689519] [811e99e4] xfs_iomap_write_allocate+0x1a5/0x29f Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689797] [811df12a] xfs_map_blocks+0x13e/0x1dd Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690016] [811dfbff] xfs_vm_writepage+0x24e/0x410 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690233] [810bde1e] __writepage+0x17/0x30 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690446] [810be6ed] write_cache_pages+0x276/0x3c8 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690693] [810bde07] ? set_page_dirty+0x60/0x60 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690908] [810be884] generic_writepages+0x45/0x5c Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691123] [811defcb] xfs_vm_writepages+0x4d/0x54 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.691337]
Re: RBD layering design draft
On 06/18/2012 09:25 AM, Tommi Virtanen wrote: On Fri, Jun 15, 2012 at 5:46 PM, Sage Weils...@inktank.com wrote: Is 'preserve' and 'unpreserve' the verbiage we want to use here? Not sure I have a better suggestion, but preserve is unusual. protect/unprotect? The flag protects the image snapshot from being deleted. unremovable/removable? undeletable/deletable? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible deadlock condition
I don't know enough to know if there's a connection, but I do note this prior thread that sounds kinda similar: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/6574 On 06/18/2012 04:08 PM, Mandell Degerness wrote: None of the OSDs seem to be more than 82% full. I didn't think we were running quite that close to the margin, but it is still far from actually full. On Mon, Jun 18, 2012 at 3:57 PM, Dan Mickdan.m...@inktank.com wrote: Does the xfs on the OSD have plenty of free space left, or could this be an allocation deadlock? On 06/18/2012 03:17 PM, Mandell Degerness wrote: Here is, perhaps, a more useful traceback from a different run of tests that we just ran into: Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.680815] INFO: task flush-254:0:29582 blocked for more than 120 seconds. Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681040] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681458] flush-254:0 D 880bd9ca2fc0 0 29582 2 0x Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.681740] 88006e51d160 0046 0002 88061b362040 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.682173] 88006e51d160 000120c0 000120c0 000120c0 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.682659] 88006e51dfd8 000120c0 000120c0 88006e51dfd8 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683088] Call Trace: Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683302] [81520132] schedule+0x5a/0x5c Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683514] [815203e7] schedule_timeout+0x36/0xe3 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683784] [8101e0b2] ? physflat_send_IPI_mask+0xe/0x10 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.683999] [8101a237] ? native_smp_send_reschedule+0x46/0x48 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684219] [811e0071] ? list_move_tail+0x27/0x2c Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684432] [81520d13] __down_common+0x90/0xd4 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684708] [811e1120] ? _xfs_buf_find+0x17f/0x210 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.684925] [81520dca] __down+0x1d/0x1f Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685139] [8105db4e] down+0x2d/0x3d Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685350] [811e0f68] xfs_buf_lock+0x76/0xaf Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685565] [811e1120] _xfs_buf_find+0x17f/0x210 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.685836] [811e13b6] xfs_buf_get+0x2a/0x177 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686052] [811e19f6] xfs_buf_read+0x1f/0xca Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686270] [8122a0b7] xfs_trans_read_buf+0x205/0x308 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.686490] [81205e01] xfs_btree_read_buf_block.clone.22+0x4f/0xa7 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687015] [8122a3ee] ? xfs_trans_log_buf+0xb2/0xc1 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687232] [81205edd] xfs_btree_lookup_get_block+0x84/0xac Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687449] [81208e83] xfs_btree_lookup+0x12b/0x3dc Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687721] [811f6bb2] ? xfs_alloc_vextent+0x447/0x469 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.687939] [811fd171] xfs_bmbt_lookup_eq+0x1f/0x21 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688156] [811ffa88] xfs_bmap_add_extent_delay_real+0x5b5/0xfec Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688378] [810f155b] ? kmem_cache_alloc+0x87/0xf3 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688650] [81204c40] ? xfs_bmbt_init_cursor+0x3f/0x107 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.688867] [81201160] xfs_bmapi_allocate+0x1f6/0x23a Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689084] [812185bd] ? xfs_iext_bno_to_irec+0x95/0xb9 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689301] [81203414] xfs_bmapi_write+0x32d/0x5a2 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689519] [811e99e4] xfs_iomap_write_allocate+0x1a5/0x29f Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.689797] [811df12a] xfs_map_blocks+0x13e/0x1dd Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690016] [811dfbff] xfs_vm_writepage+0x24e/0x410 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690233] [810bde1e] __writepage+0x17/0x30 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690446] [810be6ed] write_cache_pages+0x276/0x3c8 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690693] [810bde07] ? set_page_dirty+0x60/0x60 Jun 18 17:58:51 node-172-29-0-15 kernel: [242522.690908] [810be884] generic_writepages+0x45/0x5c Jun 18 17:58:51
RE: Performance benchmark of rbd
Hi, Mark and all: I think you may miss this mail before, so I send it again. == I forget to mention one thing, I create the rbd at the same machine and test it. That means the network latency may be lower than normal case. 1. I use ext4 as the backend filesystem and with following attribute. data=writeback,noatime,nodiratime,user_xattr 2. I use the default replication number, I think it is 2, right? 3. On my platform, I have 192GB memory 4. Sorry about the column name is left-right reversal. Here is the correct one Seq-writeSeq-read 32 KB 23 MB/s 690 MB/s 512 KB 26 MB/s 960 MB/s 4 MB 27 MB/s 1290 MB/s 32 MB 36 MB/s 1435 MB/s 5. If I put all the journal data on a SSD device (Intel 520). The sequence write performance would reach 135MB/s instead of 27MB/s in original. (object size = 4MB). And others are no different, including random-write. I am curious why the SSD device doesn't help the performance of random-write. 6. For the random read write, the data I provided before was correct. But I can give you the detail. Is it too high than what you expected? rand-write-4k rand-write-16k bw iopsbw iops 3,524 881 9,032 564 mix-4k (50/50) r:bwr:iops w:bww:iops 2,925 731 2,924 731 mix-8k (50/50) r:bwr:iops w:bww:iops 4,509 563 4,509 563 mix-16k (50/50) r:bwr:iops w:bww:iops 8,366 522 8,345 521 7. Here is the hw raid cache policy we used now. Write PolicyWrite Back with BBU Read Policy ReadAhead If you are interested in how HW raid help the performance, I can do for little help, since we also want to know what is the best configuration on our platform. Any test you want to know? Furthermore, is there any suggestion for our platform that can improve the performance? Thanks! -Original Message- From: Mark Nelson [mailto:mark.nel...@inktank.com] Sent: Wednesday, June 13, 2012 8:30 PM To: Eric YH Chen/WYHQ/Wiwynn Cc: ceph-devel@vger.kernel.org Subject: Re: Performance benchmark of rbd Hi Eric! On 6/13/12 5:06 AM, eric_yh_c...@wiwynn.com wrote: Hi, all: I am doing some benchmark of rbd. The platform is on a NAS storage. CPU: Intel E5640 2.67GHz Memory: 192 GB Hard Disk: SATA 250G * 1, 7200 rpm (H0) + SATA 1T * 12 , 7200rpm (H1~ H12) RAID Card: LSI 9260-4i OS: Ubuntu12.04 with Kernel 3.2.0-24 Network: 1 Gb/s We create 12 OSD on H1 ~ H12 with the journal is put on H0. Just to make sure I understand, you have a single node with 12 OSDs and 3 mons, and all 12 OSDs are using the H0 disk for their journals? What filesystem are you using for the OSDs? How much replication? We also create 3 MON in the cluster. In briefly, we setup a ceph cluster all-in-one, with 3 monitors and 12 OSD. The benchmark tool we used is fio 2.0.3. We had 7 basic test case 1) sequence write with bs=64k 2) sequence read with bs=64k 3) random write with bs=4k 4) random write with bs=16k 5) mix read/write with bs=4k 6) mix read/write with bs=8k 7) mix read/write with bs=16k We create several rbd with different object size for the benchmark. 1. size = 20G, object size = 32KB 2. size = 20G, object size = 512KB 3. size = 20G, object size = 4MB 4. size = 20G, object size = 32MB Given how much memory you have, you may want to increase the amount of data you are writing during each test to rule out caching. We have some conclusion after the benchmark. a. We can get better performance of sequence read/write when the object size is bigger. Seq-read Seq-write 32 KB23 MB/s 690 MB/s 512 KB26 MB/s 960 MB/s 4 MB27 MB/s 1290 MB/s 32 MB36 MB/s 1435 MB/s Which test are these results from? I'm suspicious that the write numbers are so high. Figure that even with a local client and 1X replication, your journals and data partitions are each writing out a copy of the data. You don't have enough disk in that box to sustain 1.4GB/s to both even under perfectly ideal conditions. Given that it sounds like you are using a single 7200rpm disk for 12 journals, I would expect far lower numbers... b. There is no obvious influence for random read/write when the object size is different. All the result are in a range not more than 10%. rand-write-4K rand-write-16K mix-4K mix-8kmix-16k 881 iops 564 iops 1462 iops
Re: Heavy speed difference between rbd and custom pool
Hi Stephann recommandations are 30-50 PGS by osd if I remember. - Mail original - De: Mark Nelson mark.nel...@inktank.com À: Stefan Priebe s.pri...@profihost.ag Cc: ceph-devel@vger.kernel.org Envoyé: Mardi 19 Juin 2012 00:23:49 Objet: Re: Heavy speed difference between rbd and custom pool On 06/18/2012 04:39 PM, Stefan Priebe wrote: Hello list, i'm getting these rbd bench values for pool rbd. They're high and constant. - RBD pool # rados -p rbd bench 30 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 30 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 274 258 1031.77 1032 0.043758 0.0602236 2 16 549 533 1065.82 1100 0.072168 0.0590944 3 16 825 809 1078.5 1104 0.040162 0.058682 4 16 1103 1087 1086.84 1112 0.052508 0.0584277 5 16 1385 1369 1095.04 1128 0.060233 0.0581288 6 16 1654 1638 1091.85 1076 0.050697 0.0583385 7 16 1939 1923 1098.71 1140 0.063716 0.057964 8 16 2219 2203 1101.35 1120 0.055435 0.0579105 9 16 2497 2481 1102.52 1112 0.060413 0.0578282 10 16 2773 2757 1102.66 1104 0.051134 0.0578561 11 16 3049 3033 1102.77 1104 0.057742 0.0578803 12 16 3326 3310 1103.19 1108 0.053769 0.0578627 13 16 3604 3588 1103.86 1112 0.064574 0.0578453 14 16 3883 3867 1104.72 1116 0.056524 0.0578018 15 16 4162 4146 1105.46 1116 0.054581 0.0577626 16 16 4440 4424 1105.86 1112 0.079015 0.057758 17 16 4725 4709 1107.86 1140 0.043511 0.0576647 18 16 5007 4991 1108.97 1128 0.053005 0.0576147 19 16 5292 5276 1110.6 1140 0.069004 0.057538 2012-06-18 23:36:19.124472min lat: 0.028568 max lat: 0.201941 avg lat: 0.0574953 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 5574 5558 .46 1128 0.048482 0.0574953 21 16 5861 5845 1113.18 1148 0.051923 0.0574146 22 16 6147 6131 1114.58 1144 0.04461 0.0573461 23 16 6438 6422 1116.72 1164 0.050383 0.0572406 24 16 6724 6708 1117.85 1144 0.067827 0.0571864 25 16 7008 6992 1118.57 1136 0.049128 0.057147 26 16 7296 7280 1119.85 1152 0.050331 0.0570879 27 16 7573 7557 1119.4 1108 0.052711 0.0571132 28 16 7858 7842 1120.13 1140 0.056369 0.0570764 29 16 8143 8127 1120.81 1140 0.046558 0.0570438 30 16 8431 8415 1121.85 1152 0.049958 0.0569942 Total time run: 30.045481 Total writes made: 8431 Write size: 4194304 Bandwidth (MB/sec): 1122.432 Stddev Bandwidth: 26.0451 Max bandwidth (MB/sec): 1164 Min bandwidth (MB/sec): 1032 Average Latency: 0.0570069 Stddev Latency: 0.0128039 Max latency: 0.235536 Min latency: 0.028568 - I created then a custom pool called kvmpool. ~# ceph osd pool create kvmpool pool 'kvmpool' created But with this one i get slow and jumping values: kvmpool ~# rados -p kvmpool bench 30 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 30 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 231 215 859.863 860 0.204867 0.069195 2 16 393 377 753.899 648 0.049444 0.0811933 3 16 535 519 691.908 568 0.232365 0.0899074 4 16 634 618 617.913 396 0.032758 0.0963399 5 16 806 790 631.913 688 0.075811 0.099529 6 16 948 932 621.249 568 0.156988 0.10179 7 16 1086 1070 611.348 552 0.036177 0.102064 8 16 1206 1190 594.922 480 0.028491 0.105235 9 16 1336 1320 586.589 520 0.041009 0.108735 10 16 1512 1496 598.32 704 0.258165 0.105086 11 16 1666 1650 599.921 616 0.040967 0.106146 12 15 1825 1810 603.255 640 0.198851 0.105463 13 16 1925 1909 587.309 396 0.042577 0.108449 14 16 2135 2119 605.352 840 0.035767 0.105219 15 16 2272 2256 601.523 548 0.246136 0.105357 16 16 2426 2410 602.424 616 0.19881 0.105692 17 16 2529 2513 591.22 412 0.031322 0.105463 18 16 2696 2680 595.48 668 0.028081 0.106749 19 16 2878 2862 602.449 728 0.044929 0.105856 2012-06-18 23:38:45.566094min lat: 0.023295 max lat: 0.763797 avg lat: 0.105597 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 3041 3025 604.921 652 0.036028 0.105597 21 16 3182 3166 602.964 564 0.035072 0.104915 22 16 3349 605.916 668 0.030493 0.105304 23 16 3512 3496 607.917 652 0.030523 0.10479 24 16 3668 3652 608.584 624 0.232933 0.10475 25 16 3821 3805 608.717 612 0.029881 0.104513 26 16 3963 3947 607.148 568 0.050244 0.10531 27 16 4112 4096 606.733 596 0.259069 0.105008 28 16 4261 4245 606.347 596 0.211877 0.105215 29 16 4437 4421 609.712 704 0.02802 0.104613 30 16 4566 4550 606.586 516 0.047076 0.105111 Total time run: 30.062141 Total writes made: 4566 Write size: 4194304 Bandwidth (MB/sec): 607.542 Stddev Bandwidth: 109.112 Max bandwidth (MB/sec): 860 Min bandwidth (MB/sec): 396 Average Latency: 0.10532 Stddev Latency: 0.108369 Max latency: 0.763797 Min latency: 0.023295 Why do these pools differ? Where is the