Re: Ceph performance improvement
Not sure what version of glibc Wheezy has, but try to make sure you have one that supports syncfs (you'll also need a semi-new kernel, 3.0+ should be fine). Hi, glibc from wheezy don't have syncfs support. - Mail original - De: Mark Nelson mark.nel...@inktank.com À: Denis Fondras c...@ledeuns.net Cc: ceph-devel@vger.kernel.org Envoyé: Mercredi 22 Août 2012 14:35:28 Objet: Re: Ceph performance improvement On 08/22/2012 03:54 AM, Denis Fondras wrote: Hello all, Hello! David had some good comments in his reply, so I'll just add in a couple of extra thoughts... I'm currently testing Ceph. So far it seems that HA and recovering are very good. The only point that prevents my from using it at datacenter-scale is performance. First of all, here is my setup : - 1 OSD/MDS/MON on a Supermicro X9DR3-F/X9DR3-F (1x Intel Xeon E5-2603 - 4 cores and 8GB RAM) running Debian Sid/Wheezy and Ceph version 0.49 Not sure what version of glibc Wheezy has, but try to make sure you have one that supports syncfs (you'll also need a semi-new kernel, 3.0+ should be fine). (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). It has 1x 320GB drive for the system, 1x 64GB SSD (Crucial C300 - /dev/sda) for the journal and 4x 3TB drive (Western Digital WD30EZRX). Everything but the boot partition is BTRFS-formated and 4K-aligned. - 1 client (P4 3.00GHz dual-core, 1GB RAM) running Debian Sid/Wheezy and Ceph version 0.49 (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). Both servers are linked over a 1Gb Ethernet switch (iperf shows about 960Mb/s). Here is my ceph.conf : --cut-here-- [global] auth supported = cephx keyring = /etc/ceph/keyring journal dio = true osd op threads = 24 osd disk threads = 24 filestore op threads = 6 filestore queue max ops = 24 osd client message size cap = 1400 ms dispatch throttle bytes = 1750 default values are quite a bit lower for most of these. You may want to play with them and see if it has an effect. [mon] mon data = /home/mon.$id keyring = /etc/ceph/keyring.$name [mon.a] host = ceph-osd-0 mon addr = 192.168.0.132:6789 [mds] keyring = /etc/ceph/keyring.$name [mds.a] host = ceph-osd-0 [osd] osd data = /home/osd.$id osd journal = /home/osd.$id.journal osd journal size = 1000 keyring = /etc/ceph/keyring.$name [osd.0] host = ceph-osd-0 btrfs devs = /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201 btrfs options = rw,noatime Just fyi, we are trying to get away from btrfs devs. --cut-here-- Here are some figures : * Test with dd on the OSD server (on drive /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : # dd if=/dev/zero of=testdd bs=4k count=4M 17179869184 bytes (17 GB) written, 123,746 s, 139 MB/s Good job using a data file that is much bigger than main memory! That looks pretty accurate for a 7200rpm spinning disk. For dd benchmarks, you should probably throw in conv=fdatasync at the end though. = iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 0,00 0,00 0,52 41,99 0,00 57,48 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdf 247,00 0,00 125520,00 0 125520 * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz to the OSD server (on drive /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : # time tar xzf src.tar.gz real 0m9.669s user 0m8.405s sys 0m4.736s # time rm -rf * real 0m3.647s user 0m0.036s sys 0m3.552s = iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 10,83 0,00 28,72 16,62 0,00 43,83 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdf 1369,00 0,00 9300,00 0 9300 * Test with dd from the client using RBD : # dd if=/dev/zero of=testdd bs=4k count=4M 17179869184 bytes (17 GB) written, 406,941 s, 42,2 MB/s RBD caching should definitely be enabled for a test like this. I'd be surprised if you got 42MB/s without it though... = iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 4,57 0,00 30,46 27,66 0,00 37,31 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 317,00 0,00 57400,00 0 57400 sdf 237,00 0,00 88336,00 0 88336 * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the client using RBD : # time tar xzf src.tar.gz real 0m26.955s user 0m9.233s sys 0m11.425s # time rm -rf * real 0m8.545s user 0m0.128s sys 0m8.297s = iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 4,59 0,00 24,74 30,61 0,00 40,05 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 239,00 0,00 54772,00 0 54772 sdf 441,00 0,00 50836,00 0 50836 * Test with dd from the client using CephFS : # dd if=/dev/zero of=testdd bs=4k count=4M 17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s = iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait
Ideal hardware spec?
Hi All, Yes I'm asking the impossible question, what is the best hardware confing. I'm looking at (possibly) using ceph as backing store for images and volumes on OpenStack as well as exposing at least the object store for direct use. The openstack cluster exists and is currently in the early stages of use by researchers here, approx 1500 vCPU (counts hyperthreads actually 768 physical cores) and 3T or RAM across 64 physical nodes. On the object store side it would be a new resource for usand hard to say what people would do with it except that it would be many different things and the use profile would be constantly changing (which is true of all our existing storage). In this sense, even though it's a private cloud the somewhat unpredictable useage profile gives it some charateristics of a small public cloud. Size wise I'm hoping to start out with 3 monitors and 5(+) OSD nodes to end up with a 20-30T 3x replicated storage (call me paranoid). So the monitor specs seem relatively easy to come up with. For the OSDs it looks like http://ceph.com/docs/master/install/hardware-recommendations suggests 1 drive, 1 core and 2G RAM per OSD (with multiple OSDs per storage node). On list discussions seem to frequently include an SSD for journaling (which is similar to what we do for our current ZFS back NFS storage). I'm hoping to wrap the hardware in a grant and willing to experiment a bit with different software configurations to tune it up when/if I get the hardware in. So my imediate concern is a hardware spec that will ahve a reasonable processor:memory:disk ratio and opinions (or better data) on the utility of SSD. First is the documented core to disk ratio still current best practice? Given a platform with more drive slots could 8 cores handle more disk? would that need/like more memory? Have SSD been shown to speed performance with this architecture? If so given the 8 drive slot example with seven OSDs presented in the docs what is the liklihood that using a high performance SSD for the OS image and also cutting journal/log partitions out of it for the remaining 7 2-3T near line SAS drives? Thanks, -Jon -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ideal hardware spec?
Hi, On 08/22/2012 03:55 PM, Jonathan Proulx wrote: Hi All, Yes I'm asking the impossible question, what is the best hardware confing. I'm looking at (possibly) using ceph as backing store for images and volumes on OpenStack as well as exposing at least the object store for direct use. The openstack cluster exists and is currently in the early stages of use by researchers here, approx 1500 vCPU (counts hyperthreads actually 768 physical cores) and 3T or RAM across 64 physical nodes. On the object store side it would be a new resource for usand hard to say what people would do with it except that it would be many different things and the use profile would be constantly changing (which is true of all our existing storage). In this sense, even though it's a private cloud the somewhat unpredictable useage profile gives it some charateristics of a small public cloud. Size wise I'm hoping to start out with 3 monitors and 5(+) OSD nodes to end up with a 20-30T 3x replicated storage (call me paranoid). I prefer 3x replication as well. I've seen the wrong OSDs die on me too often. So the monitor specs seem relatively easy to come up with. For the OSDs it looks like http://ceph.com/docs/master/install/hardware-recommendations suggests 1 drive, 1 core and 2G RAM per OSD (with multiple OSDs per storage node). On list discussions seem to frequently include an SSD for journaling (which is similar to what we do for our current ZFS back NFS storage). I'm hoping to wrap the hardware in a grant and willing to experiment a bit with different software configurations to tune it up when/if I get the hardware in. So my imediate concern is a hardware spec that will ahve a reasonable processor:memory:disk ratio and opinions (or better data) on the utility of SSD. First is the documented core to disk ratio still current best practice? Given a platform with more drive slots could 8 cores handle more disk? would that need/like more memory? I'd still suggest about 2GB of RAM per OSD. The more RAM you have in the OSD machines, the more the kernel can buffer, which will always be a performance gain. You should however ask yourself the question if you want a lot of OSDs per server and not go for smaller machines with less disks. For example - 1U - 4 cores - 8GB RAM - 4 disks - 1 SSD Or - 2U - 8 cores - 16GB RAM - 8 disks - 1|2 SSDs Both will give you the same amount of storage, but the impact of loosing one physicial machine will be larger with the 2U machine. If you take 1TB disks you'd loose 8TB of storage, that is a lot of recovery to be done. Since btrfs (Assuming you are going to use that) is still in development it's not excluded that your machine goes down due to a kernel panic or other problems. My personal favor is having multiple small(er) machines than having a couple of large machines. Have SSD been shown to speed performance with this architecture? I've seen a improvement in performance indeed. Make sure however you have a recent version of glibc with syncfs support. If so given the 8 drive slot example with seven OSDs presented in the docs what is the liklihood that using a high performance SSD for the OS image and also cutting journal/log partitions out of it for the remaining 7 2-3T near line SAS drives? You should make sure your SSD is capable of doing line-speed of your network. If you are connecting the machines with 4G trunks, make sure the SSD is capable of doing around 400MB/sec of sustained writes. I'd recommended the Intel 520 SSDs and change their available capacity with hdparm to about 20% of their original capacity. This way the SSD always has a lot of free cells available for writing. Reprogramming cells is expensive on an SSD. You can run the OS on the same SSD since that won't do that much I/O. I'd recommend not logging locally though, since that will also write to the same SSD. Try using remote syslog. You can also use the USB sticks[0] from Stec, they have servergrade onboard USB sticks for these kind of applications. A couple of questions still need to be answered though: * Which OS are you planning on using? Ubuntu 12.04 is recommended * Which filesystem do you want to use underneath the OSDs? Wido [0]: http://www.stec-inc.com/product/ufm.php Thanks, -Jon -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: wip-crush
On 08/22/2012 03:10 AM, Sage Weil wrote: I pushed a branch that changes some of the crush terminology. Instead of having a crush type called pool that requires you to say things like pool=default in the ceph osd crush set ... command, it uses root instead. That hopefully reinforces that it is a tree/hierarchy. There is also a patch that changes bucket to node throughout, since bucket is a term also used by radosgw. Thoughts? I think the main pain in making this transition is that old clusters have maps that have a type 'pool' and new ones won't, and the docs will need to walk people through both... pool in a crushmap being completely unrelated to a RADOS pool is something that I've heard customers/users report as confusing, as well. So changing that is probably a good thing. Naming it root is probably a good choice as well, as it happens to match http://ceph.com/wiki/Custom_data_placement_with_CRUSH. As for changing bucket to node... a node is normally simply a physical server (at least in HA terminology, which many potential Ceph users will be familiar with), and CRUSH uses host for that. So that's another recipe for confusion. How about using something super-generic, like element or item? Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph performance improvement
On Wed, Aug 22, 2012 at 9:23 AM, Denis Fondras c...@ledeuns.net wrote: Are you sure your osd data and journal are on the disks you think? The /home paths look suspicious -- especially for journal, which often should be a block device. I am :) ... -rw-r--r-- 1 root root 1048576000 août 22 17:22 /home/osd.0.journal Your journal is a file on a btrfs partition. That is probably a bad idea for performance. I'd recommend partitioning the drive and using partitions as journals directly. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Ceph fixes for 3.6-rc3
Hi Linus, Please pull the following Ceph fixes for -rc3 from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus Jim's fix closes a narrow race introduced with the msgr changes. One fix resolves problems with debugfs initialization that Yan found when multiple client instances are created (e.g., two clusters mounted, or rbd + cephfs), another one fixes problems with mounting a nonexistent server subdirectory, and the last one fixes a divide by zero error from unsanitized ioctl input that Dan Carpenter found. Thanks! sage Jim Schutt (1): libceph: avoid truncation due to racing banners Sage Weil (3): libceph: delay debugfs initialization until we learn global_id ceph: tolerate (and warn on) extraneous dentry from mds ceph: avoid divide by zero in __validate_layout() fs/ceph/debugfs.c |1 + fs/ceph/inode.c| 15 + fs/ceph/ioctl.c|3 +- net/ceph/ceph_common.c |1 - net/ceph/debugfs.c |4 +++ net/ceph/messenger.c | 11 - net/ceph/mon_client.c | 51 +++ 7 files changed, 72 insertions(+), 14 deletions(-) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: wip-crush
On Wed, Aug 22, 2012 at 9:33 AM, Sage Weil s...@inktank.com wrote: On Wed, 22 Aug 2012, Atchley, Scott wrote: On Aug 22, 2012, at 10:46 AM, Florian Haas wrote: On 08/22/2012 03:10 AM, Sage Weil wrote: I pushed a branch that changes some of the crush terminology. Instead of having a crush type called pool that requires you to say things like pool=default in the ceph osd crush set ... command, it uses root instead. That hopefully reinforces that it is a tree/hierarchy. There is also a patch that changes bucket to node throughout, since bucket is a term also used by radosgw. Thoughts? I think the main pain in making this transition is that old clusters have maps that have a type 'pool' and new ones won't, and the docs will need to walk people through both... pool in a crushmap being completely unrelated to a RADOS pool is something that I've heard customers/users report as confusing, as well. So changing that is probably a good thing. Naming it root is probably a good choice as well, as it happens to match http://ceph.com/wiki/Custom_data_placement_with_CRUSH. As for changing bucket to node... a node is normally simply a physical server (at least in HA terminology, which many potential Ceph users will be familiar with), and CRUSH uses host for that. So that's another recipe for confusion. How about using something super-generic, like element or item? Cheers, Florian My guess is that he is trying to use data structure tree nomenclature (root, node, leaf). I agree that node is an overloaded term (as is pool). Yeah... As for an alternative to bucket which indicates the item is a collection, what about subtree or branch? I think fixing the overloading of 'pool' in the default crush map is the biggest pain point. I can live with crush 'buckets' staying the same (esp since that's what the papers and code use pervasively) if we can't come up with a better option. I'm definitely most interested in replacing pool, and root works for that in my mind. RGW buckets live at a sufficiently different level that I think people are unlikely to be confused — and bucket is actually a good name for what they are (I'm open to better ones, but I don't think that node qualifies). On the pool part, though, the challenge is how to transition. Existing clusters have maps that use 'pool', and new clusters will use 'root' (or whatever). Some options: - document both. this kills much of the benefit of switching, but is probably inevitable since people will be running different versions. - make the upgrade process transparently rename the type. this lets all the tools use the new names. - make the tools silently translate old names to new names. this is kludgey in that it makes the code make assumptions about the names of the data it is working with, but would cover everyone except those who created their own crush maps from scratch. - ? I would go with option two, and only document the new options — I wouldn't be surprised if the number of people who had changed those was zero. Anybody who has done so can certainly be counted on to pay enough attention that a line note changed CRUSH names (see here if you customized your map) would be sufficient, right? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SimpleMessenger dispatching: cause of performance problems?
What rbd block size were you using? -Sam On Tue, Aug 21, 2012 at 10:29 PM, Andreas Bluemle andreas.blue...@itxperts.de wrote: Hi, Samuel Just wrote: Was the cluster complete healthy at the time that those traces were taken? If there were osds going in/out/up/down, it would trigger osdmap updates which would tend to hold the osd_lock for an extended period of time. The cluster was completely healthy. v0.50 included some changes that drastically reduce the purview of osd_lock. In particular, pg op handling no longer grabs the osd_lock and handle_osd_map can proceed independently of the pg worker threads. Trying that might be interesting. I'll grab v0.50 and take a look. -Sam On Tue, Aug 21, 2012 at 12:20 PM, Sage Weil s...@inktank.com wrote: On Tue, 21 Aug 2012, Sage Weil wrote: On Tue, 21 Aug 2012, Andreas Bluemle wrote: Hi Sage, as mentioned, the workload is a single sequential write on the client. The write is not O_DIRECT; and consequently the messages arrive at the OSD with 124 KByte per write request. The attached pdf shows a timing diagram of two concurrent write operations (one primary and one replication / secondary). The time spent on the primary write to get the OSD.:osd_lock releates nicely with the time when this lock is released by the secondary write. Looking again at this diagram, I'm a bit confused. Is the Y access the thread id or something? And the X axis is time in seconds? X-Axis is time, Y Axis is absolute offset of the write request on the rados block device. The big question for me is what on earth the secondary write (or primary, for that matter) is doing with osd_lock for a full 3 ms... If my reading of the units is correct, *that* is the real problem. It shouldn't be doing anything that takes that long. The exception is osdmap handling, which can do more work, but request processing should be very fast. Thanks- sage Ah, I see. There isn't a trivial way to pull osd_lock out of the picture; there are several data structures it's protecting (pg_map, osdmaps, peer epoch map, etc.). Before we try going down that road, I suspect it might be more fruitful to see where cpu time is being spent while osd_lock is held. How much of an issue does it look like this specific contention is for you? Does it go away with larger writes? sage Hope this helps Andreas Sage Weil wrote: On Mon, 20 Aug 2012, Andreas Bluemle wrote: Hi Sage, Sage Weil wrote: Hi Andreas, On Thu, 16 Aug 2012, Andreas Bluemle wrote: Hi, I have been trying to migrate a ceph cluster (ceph-0.48argonaut) to a high speed cluster network and encounter scalability problems: the overall performance of the ceph cluster does not scale well with an increase in the underlying networking speed. In short: I believe that the dispatching from SimpleMessenger to OSD worker queues causes that scalability issue. Question: is it possible that this dispatching is causing performance problems? There is a single 'dispatch' thread that's processing this queue, and conveniently perf lets you break down its profiling data on a per-thread basis. Once you've ruled out the throttler as the culprit, you might try running the daemon with 'perf record -g -- ceph-osd ...' and then look specifically at where that thread is spending its time. We shouldn't be burning that much CPU just doing the sanity checks and then handing requests off to PGs... sage The effect, which I am seeing, may be related to some locking issue. As I read the code, there are multiple dispatchers running: one per SimpleMessenger. On a typical OSD node, there is - the instance of the SimpleMessenger processing input from the client (primary writes) - other instances of SimpleMessenger, which process input from neighbor OSD nodes the latter generate replication writes to the OSD I am looking at. On the other hand, there is a single instance of the OSD object within the ceph-osd daemon. When dispatching messages to the OSD, then the OSD::osd_lock is held for the complete process of dispatching; see code below. When the write load increases, then multiple SimpleMessenger instances start to congest on the OSD::osd_lock. And this may cause delays in the individual dispatch process. This is definitely possible, yes, although it would surprise me if it's happening here (unless your workload is all small writes). Just to confirm, are you actually observing osd_lock contention, or speculating about what is causing the delays you're seeing? I'm not sure what the best tool is to measure lock contention. Mark was playing with a 'poor man's wall clock profiler' using stack trace sampling from gdb. That would tell us whether threads were really blocking while obtaining the osd_lock... Can you tell us a bit more about what your workload is? sage bool OSD::ms_dispatch(Message *m)
Re: Ceph performance improvement / journal on block-dev
On Wed, Aug 22, 2012 at 06:29:12PM +0200, Tommi Virtanen wrote: (...) Your journal is a file on a btrfs partition. That is probably a bad idea for performance. I'd recommend partitioning the drive and using partitions as journals directly. Hi Tommi, can you please teach me how to use the right parameter(s) to realize 'journal on block-dev' ? It looks like something is not OK during 'mkcephfs -a -c /etc/ceph/ceph.conf --mkbtrfs' (see below) Regards, -Dieter e.g. ---snip--- modprobe -v brd rd_nr=6 rd_size=1000# 6x 10G RAM DISK /etc/ceph/ceph.conf -- [global] auth supported = none # set log file log file = /ceph/log/$name.log log_to_syslog = true# uncomment this line to log to syslog # set up pid files pid file = /var/run/ceph/$name.pid [mon] mon data = /ceph/$name debug optracker = 0 [mon.alpha] host = 127.0.0.1 mon addr = 127.0.0.1:6789 [mds] debug optracker = 0 [mds.0] host = 127.0.0.1 [osd] osd data = /data/$name [osd.0] host = 127.0.0.1 btrfs devs = /dev/ram0 osd journal = /dev/ram3 [osd.1] host = 127.0.0.1 btrfs devs = /dev/ram1 osd journal = /dev/ram4 [osd.2] host = 127.0.0.1 btrfs devs = /dev/ram2 osd journal = /dev/ram5 -- root # mkcephfs -a -c /etc/ceph/ceph.conf --mkbtrfs temp dir is /tmp/mkcephfs.wzARGSpFB6 preparing monmap in /tmp/mkcephfs.wzARGSpFB6/monmap /usr/bin/monmaptool --create --clobber --add alpha 127.0.0.1:6789 --print /tmp/mkcephfs.wzARGSpFB6/monmap /usr/bin/monmaptool: monmap file /tmp/mkcephfs.wzARGSpFB6/monmap /usr/bin/monmaptool: generated fsid 40b997ea-387a-4deb-9a30-805cd076a0de epoch 0 fsid 40b997ea-387a-4deb-9a30-805cd076a0de last_changed 2012-08-22 21:04:00.553972 created 2012-08-22 21:04:00.553972 0: 127.0.0.1:6789/0 mon.alpha /usr/bin/monmaptool: writing epoch 0 to /tmp/mkcephfs.wzARGSpFB6/monmap (1 monitors) === osd.0 === pushing conf and monmap to 127.0.0.1:/tmp/mkfs.ceph.11005 umount: /data/osd.0: not mounted umount: /dev/ram0: not mounted Btrfs v0.19.1+ ATTENTION: mkfs.btrfs is not intended to be used directly. Please use the YaST partitioner to create and manage btrfs filesystems to be in a supported state on SUSE Linux Enterprise systems. fs created label (null) on /dev/ram0 nodesize 4096 leafsize 4096 sectorsize 4096 size 9.54GiB Scanning for Btrfs filesystems HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 2012-08-22 21:04:01.519073 7fb475e8b780 -1 journal check: ondisk fsid 8b18c558-8b40-4b07-aa66-61fecb4dd89d doesn't match expected ee0b8bf1-dd4a-459e-a218-3f590f9a8c16, invalid (someone else's?) journal HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 2012-08-22 21:04:01.923505 7fb475e8b780 -1 filestore(/data/osd.0) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory 2012-08-22 21:04:01.937429 7fb475e8b780 -1 created object store /data/osd.0 journal /dev/ram3 for osd.0 fsid 40b997ea-387a-4deb-9a30-805cd076a0de creating private key for osd.0 keyring /data/osd.0/keyring creating /data/osd.0/keyring collecting osd.0 key === osd.1 === pushing conf and monmap to 127.0.0.1:/tmp/mkfs.ceph.11005 umount: /data/osd.1: not mounted (...) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash
On Thu, 23 Aug 2012, Andrey Korolyov wrote: Hi, today during heavy test a pair of osds and one mon died, resulting to hard lockup of some kvm processes - they went unresponsible and was killed leaving zombie processes ([kvm] defunct). Entire cluster contain sixteen osd on eight nodes and three mons, on first and last node and on vm outside cluster. osd bt: #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 (gdb) bt #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #1 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #2 0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4 #3 0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at /usr/include/c++/4.7/bits/basic_string.h:246 #4 ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at /usr/include/c++/4.7/bits/basic_string.h:536 #5 ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out) at /usr/include/c++/4.7/sstream:60 #6 ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439 #7 pretty_version_to_str () at common/version.cc:40 #8 0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10, out=...) at common/BackTrace.cc:19 #9 0x0078f450 in handle_fatal_signal (signum=11) at global/signal_handler.cc:91 #10 signal handler called #11 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4 #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #15 0x7fc37d1c4796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #16 0x7fc37d1c47c3 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #17 0x7fc37d1c49ee in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c 0 == \unexpected error\, file=optimized out, line=3007, func=0x90ef80 unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int)) at common/assert.cc:77 This means it got an unexpected error when talking to the file system. If you look in the osd log, it may tell you what that was. (It may not--there isn't usually the other tcmalloc stuff triggered from the assert handler.) What happens if you restart that ceph-osd daemon? sage #19 0x0073148f in FileStore::_do_transaction (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545, trans_num=trans_num@entry=0) at os/FileStore.cc:3007 #20 0x0073484e in FileStore::do_transactions (this=0x2cde000, tls=..., op_seq=429545) at os/FileStore.cc:2436 #21 0x0070c680 in FileStore::_do_op (this=0x2cde000, osr=optimized out) at os/FileStore.cc:2259 #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at common/WorkQueue.cc:54 #23 0x006823ed in ThreadPool::WorkThread::entry (this=optimized out) at ./common/WorkQueue.h:126 #24 0x7fc37e3eee9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #26 0x in ?? () mon bt was exactly the same as in http://tracker.newdream.net/issues/2762 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash
On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil s...@inktank.com wrote: On Thu, 23 Aug 2012, Andrey Korolyov wrote: Hi, today during heavy test a pair of osds and one mon died, resulting to hard lockup of some kvm processes - they went unresponsible and was killed leaving zombie processes ([kvm] defunct). Entire cluster contain sixteen osd on eight nodes and three mons, on first and last node and on vm outside cluster. osd bt: #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 (gdb) bt #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #1 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #2 0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4 #3 0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at /usr/include/c++/4.7/bits/basic_string.h:246 #4 ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at /usr/include/c++/4.7/bits/basic_string.h:536 #5 ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out) at /usr/include/c++/4.7/sstream:60 #6 ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439 #7 pretty_version_to_str () at common/version.cc:40 #8 0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10, out=...) at common/BackTrace.cc:19 #9 0x0078f450 in handle_fatal_signal (signum=11) at global/signal_handler.cc:91 #10 signal handler called #11 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4 #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #15 0x7fc37d1c4796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #16 0x7fc37d1c47c3 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #17 0x7fc37d1c49ee in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c 0 == \unexpected error\, file=optimized out, line=3007, func=0x90ef80 unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int)) at common/assert.cc:77 This means it got an unexpected error when talking to the file system. If you look in the osd log, it may tell you what that was. (It may not--there isn't usually the other tcmalloc stuff triggered from the assert handler.) What happens if you restart that ceph-osd daemon? sage Unfortunately I have completely disabled logs during test, so there are no suggestion of assert_fail. The main problem was revealed - created VMs was pointed to one monitor instead set of three, so there may be some unusual things(btw, crashed mon isn`t one from above, but a neighbor of crashed osds on first node). After IPMI reset node returns back well and cluster behavior seems to be okay - stuck kvm I/O somehow prevented even other module load|unload on this node, so I finally decided to do hard reset. Despite I`m using almost generic wheezy, glibc was updated to 2.15, may be because of this my trace appears first time ever. I`m almost sure that fs does not triggered this crash and mainly suspecting stuck kvm processes. I`ll rerun test with same conditions tomorrow(~500 vms pointed to one mon and very high I/O, but with osd logging). #19 0x0073148f in FileStore::_do_transaction (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545, trans_num=trans_num@entry=0) at os/FileStore.cc:3007 #20 0x0073484e in FileStore::do_transactions (this=0x2cde000, tls=..., op_seq=429545) at os/FileStore.cc:2436 #21 0x0070c680 in FileStore::_do_op (this=0x2cde000, osr=optimized out) at os/FileStore.cc:2259 #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at common/WorkQueue.cc:54 #23 0x006823ed in ThreadPool::WorkThread::entry (this=optimized out) at ./common/WorkQueue.h:126 #24 0x7fc37e3eee9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #26 0x in ?? () mon bt was exactly the same as in http://tracker.newdream.net/issues/2762 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info
Re: Ceph performance improvement / journal on block-dev
On Wed, Aug 22, 2012 at 12:12 PM, Dieter Kasper (KD) d.kas...@kabelmail.de wrote: Your journal is a file on a btrfs partition. That is probably a bad idea for performance. I'd recommend partitioning the drive and using partitions as journals directly. can you please teach me how to use the right parameter(s) to realize 'journal on block-dev' ? Replacing the example paths, use sudo parted /dev/sdg or gksu gparted /dev/sdg, create partitions, set osd journal to point to a block device for a partition. [osd.42] osd journal = /dev/sdg4 It looks like something is not OK during 'mkcephfs -a -c /etc/ceph/ceph.conf --mkbtrfs' (see below) Try running it with -x for any chance of extracting debuggable information from the monster. Scanning for Btrfs filesystems HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 2012-08-22 21:04:01.519073 7fb475e8b780 -1 journal check: ondisk fsid 8b18c558-8b40-4b07-aa66-61fecb4dd89d doesn't match expected ee0b8bf1-dd4a-459e-a218-3f590f9a8c16, invalid (someone else's?) journal Based on that, my best guess would be that you're seeing a journal from an old run -- perhaps you need to explicitly clear out the block device contents.. Frankly, you should not use btrfs devs. Any convenience you may gain is more than doubly offset by pains exactly like these. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash
The tcmalloc backtrace on the OSD suggests this may be unrelated, but what's the fd limit on your monitor process? You may be approaching that limit if you've got 500 OSDs and a similar number of clients. On Wed, Aug 22, 2012 at 6:55 PM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil s...@inktank.com wrote: On Thu, 23 Aug 2012, Andrey Korolyov wrote: Hi, today during heavy test a pair of osds and one mon died, resulting to hard lockup of some kvm processes - they went unresponsible and was killed leaving zombie processes ([kvm] defunct). Entire cluster contain sixteen osd on eight nodes and three mons, on first and last node and on vm outside cluster. osd bt: #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 (gdb) bt #0 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #1 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #2 0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4 #3 0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at /usr/include/c++/4.7/bits/basic_string.h:246 #4 ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at /usr/include/c++/4.7/bits/basic_string.h:536 #5 ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out) at /usr/include/c++/4.7/sstream:60 #6 ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439 #7 pretty_version_to_str () at common/version.cc:40 #8 0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10, out=...) at common/BackTrace.cc:19 #9 0x0078f450 in handle_fatal_signal (signum=11) at global/signal_handler.cc:91 #10 signal handler called #11 0x7fc37d490be3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /usr/lib/libtcmalloc.so.4 #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/libtcmalloc.so.4 #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4 #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #15 0x7fc37d1c4796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #16 0x7fc37d1c47c3 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #17 0x7fc37d1c49ee in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c 0 == \unexpected error\, file=optimized out, line=3007, func=0x90ef80 unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int)) at common/assert.cc:77 This means it got an unexpected error when talking to the file system. If you look in the osd log, it may tell you what that was. (It may not--there isn't usually the other tcmalloc stuff triggered from the assert handler.) What happens if you restart that ceph-osd daemon? sage Unfortunately I have completely disabled logs during test, so there are no suggestion of assert_fail. The main problem was revealed - created VMs was pointed to one monitor instead set of three, so there may be some unusual things(btw, crashed mon isn`t one from above, but a neighbor of crashed osds on first node). After IPMI reset node returns back well and cluster behavior seems to be okay - stuck kvm I/O somehow prevented even other module load|unload on this node, so I finally decided to do hard reset. Despite I`m using almost generic wheezy, glibc was updated to 2.15, may be because of this my trace appears first time ever. I`m almost sure that fs does not triggered this crash and mainly suspecting stuck kvm processes. I`ll rerun test with same conditions tomorrow(~500 vms pointed to one mon and very high I/O, but with osd logging). #19 0x0073148f in FileStore::_do_transaction (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545, trans_num=trans_num@entry=0) at os/FileStore.cc:3007 #20 0x0073484e in FileStore::do_transactions (this=0x2cde000, tls=..., op_seq=429545) at os/FileStore.cc:2436 #21 0x0070c680 in FileStore::_do_op (this=0x2cde000, osr=optimized out) at os/FileStore.cc:2259 #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at common/WorkQueue.cc:54 #23 0x006823ed in ThreadPool::WorkThread::entry (this=optimized out) at ./common/WorkQueue.h:126 #24 0x7fc37e3eee9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #26 0x in ?? () mon bt was exactly the same as in http://tracker.newdream.net/issues/2762 -- To unsubscribe from this list: send the line
Re: Ceph performance improvement
On 22/08/12 22:24, David McBride wrote: On 22/08/12 09:54, Denis Fondras wrote: * Test with dd from the client using CephFS : # dd if=/dev/zero of=testdd bs=4k count=4M 17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s Again, the synchronous nature of 'dd' is probably severely affecting apparent performance. I'd suggest looking at some other tools, like fio, bonnie++, or iozone, which might generate more representative load. (Or, if you have a specific use-case in mind, something that generates an IO pattern like what you'll be using in production would be ideal!) Appending conv=fsync to the dd will make the comparison fair enough. Looking at the ceph code, it does sync_file_range(fd, offset, blocksz, SYNC_FILE_RANGE_WRITE); which is very fast - way faster than fdatasync() and friends (I have tested this ... see prev posting on random write performance with file writetest.c attached). I am not convinced the these sort of tests are in any way 'unfair' - for instance I would like to use rbd for postgres or mysql data volumes... and many database actions involve a stream of block writes similar enough to doing dd (e.g bulk row loads, appends to transaction log journals). Cheers Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html