Re: Mysteriously poor write performance
It sounds like maybe you're using Xen? The rbd writeback window option only works for userspace rbd implementations (eg, KVM). If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 (~8MB). What options are you running dd with? If you run a rados bench from both machines, what do the results look like? Also, can you do the ceph osd bench on each of your OSDs, please? (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) -Greg On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: More strangely, writing speed drops down by fifteen percent when this option was set in vm` config(instead of result from http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes under heavy load. On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil s...@newdream.net (mailto:s...@newdream.net) wrote: On Sat, 17 Mar 2012, Andrey Korolyov wrote: Hi, I`ve did some performance tests at the following configuration: mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - dom0 with three dedicated cores and 1.5G, mostly idle. First three disks on each r410 arranged into raid0 and holds osd data when fourth holds os and osd` journal partition, all ceph-related stuff mounted on the ext4 without barriers. Firstly, I`ve noticed about a difference of benchmark performance and write speed through rbd from small kvm instance running on one of first two machines - when bench gave me about 110Mb/s, writing zeros to raw block device inside vm with dd was at top speed about 45 mb/s, for vm`fs (ext4 with default options) performance drops to ~23Mb/s. Things get worse, when I`ve started second vm at second host and tried to continue same dd tests simultaneously - performance fairly divided by half for each instance :). Enabling jumbo frames, playing with cpu affinity for ceph and vm instances and trying different TCP congestion protocols gave no effect at all - with DCTCP I have slightly smoother network load graph and that`s all. Can ml please suggest anything to try to improve performance? Can you try setting rbd writeback window = 8192000 or similar, and see what kind of effect that has? I suspect it'll speed up dd; I'm less sure about ext3. Thanks! sage ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph mon crash
On Monday, March 19, 2012 at 7:33 AM, ruslan usifov wrote: Hello I have follow stack trace: #0 0xb77fa424 in __kernel_vsyscall () (gdb) bt #0 0xb77fa424 in __kernel_vsyscall () #1 0xb77e98a0 in raise () from /lib/i386-linux-gnu/ libpthread.so.0 #2 0x08230f8b in ?? () #3 signal handler called #4 0xb77fa424 in __kernel_vsyscall () #5 0xb70eae71 in raise () from /lib/i386-linux-gnu/libc.so.6 #6 0xb70ee34e in abort () from /lib/i386-linux-gnu/libc.so.6 #7 0xb73130b5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/i386-linux-gnu/libstdc++.so.6 #8 0xb7310fa5 in ?? () from /usr/lib/i386-linux-gnu/libstdc++.so.6 #9 0xb7310fe2 in std::terminate() () from /usr/lib/i386-linux-gnu/libstdc++.so.6 #10 0xb731114e in __cxa_throw () from /usr/lib/i386-linux-gnu/libstdc++.so.6 #11 0x0822f8c7 in ceph::__ceph_assert_fail(char const*, char const*, int, char const*) () #12 0x081cf8a4 in MDSMap::get_health(std::basic_ostreamchar, std::char_traitschar ) const () #13 0x0811e8a7 in MDSMonitor::get_health(std::basic_ostreamchar, std::char_traitschar ) const () #14 0x080c4977 in Monitor::handle_command(MMonCommand*) () #15 0x080cf244 in Monitor::_ms_dispatch(Message*) () #16 0x080df1a4 in Monitor::ms_dispatch(Message*) () #17 0x081f706d in SimpleMessenger::dispatch_entry() () #18 0x080b27d2 in SimpleMessenger::DispatchThread::entry() () #19 0x081b5d81 in Thread::_entry_func(void*) () #20 0xb77e0e99 in start_thread () from /lib/i386-linux-gnu/libpthread.so.0 #21 0xb71919ee in clone () from /lib/i386-linux-gnu/libc.so.6 Can you get the line number from frame 12? (f 12 enter, then just paste the output) Also the output of ceph -s if things are still running. The only assert I see in get_health() is that each up MDS be in mds_info, which really ought to be true…. And when one mon crashes all other monitors in cluster will crashes too:-((. So one time in cluster not any alive mons Yeah, this is because the crash is being triggered by a get_health command and it's trying it out on each monitor in turn as they fail. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mysteriously poor write performance
Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage mentioned too small value and I`ve changed it to 64M before posting previous message with no success - both 8M and this value cause a performance drop. When I tried to wrote small amount of data that can be compared to writeback cache size(both on raw device and ext3 with sync option), following results were made: dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost same without oflag there and in the following samples) 10+0 records in 10+0 records out 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct 20+0 records in 20+0 records out 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct 30+0 records in 30+0 records out 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s and so on. Reference test with bs=1M and count=2000 has slightly worse results _with_ writeback cache than without, as I`ve mentioned before. Here the bench results, they`re almost equal on both nodes: bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec Also, because I`ve not mentioned it before, network performance is enough to hold fair gigabit connectivity with MTU 1500. Seems that it is not interrupt problem or something like it - even if ceph-osd, ethernet card queues and kvm instance pinned to different sets of cores, nothing changes. On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum gregory.far...@dreamhost.com wrote: It sounds like maybe you're using Xen? The rbd writeback window option only works for userspace rbd implementations (eg, KVM). If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 (~8MB). What options are you running dd with? If you run a rados bench from both machines, what do the results look like? Also, can you do the ceph osd bench on each of your OSDs, please? (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) -Greg On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: More strangely, writing speed drops down by fifteen percent when this option was set in vm` config(instead of result from http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes under heavy load. On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil s...@newdream.net (mailto:s...@newdream.net) wrote: On Sat, 17 Mar 2012, Andrey Korolyov wrote: Hi, I`ve did some performance tests at the following configuration: mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - dom0 with three dedicated cores and 1.5G, mostly idle. First three disks on each r410 arranged into raid0 and holds osd data when fourth holds os and osd` journal partition, all ceph-related stuff mounted on the ext4 without barriers. Firstly, I`ve noticed about a difference of benchmark performance and write speed through rbd from small kvm instance running on one of first two machines - when bench gave me about 110Mb/s, writing zeros to raw block device inside vm with dd was at top speed about 45 mb/s, for vm`fs (ext4 with default options) performance drops to ~23Mb/s. Things get worse, when I`ve started second vm at second host and tried to continue same dd tests simultaneously - performance fairly divided by half for each instance :). Enabling jumbo frames, playing with cpu affinity for ceph and vm instances and trying different TCP congestion protocols gave no effect at all - with DCTCP I have slightly smoother network load graph and that`s all. Can ml please suggest anything to try to improve performance? Can you try setting rbd writeback window = 8192000 or similar, and see what kind of effect that has? I suspect it'll speed up dd; I'm less sure about ext3. Thanks! sage ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mysteriously poor write performance
On 03/19/2012 11:13 AM, Andrey Korolyov wrote: Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage mentioned too small value and I`ve changed it to 64M before posting previous message with no success - both 8M and this value cause a performance drop. When I tried to wrote small amount of data that can be compared to writeback cache size(both on raw device and ext3 with sync option), following results were made: I just want to clarify that the writeback window isn't a full writeback cache - it doesn't affect reads, and does not help with request merging etc. It simply allows a bunch of writes to be in flight while acking the write to the guest immediately. We're working on a full-fledged writeback cache that to replace the writeback window. dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost same without oflag there and in the following samples) 10+0 records in 10+0 records out 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct 20+0 records in 20+0 records out 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct 30+0 records in 30+0 records out 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s and so on. Reference test with bs=1M and count=2000 has slightly worse results _with_ writeback cache than without, as I`ve mentioned before. Here the bench results, they`re almost equal on both nodes: bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec One thing to check is the size of the writes that are actually being sent by rbd. The guest is probably splitting them into relatively small (128 or 256k) writes. Ideally it would be sending 4k writes, and this should be a lot faster. You can see the writes being sent by adding debug_ms=1 to the client or osd. The format is osd_op(.*[write OFFSET~LENGTH]). Also, because I`ve not mentioned it before, network performance is enough to hold fair gigabit connectivity with MTU 1500. Seems that it is not interrupt problem or something like it - even if ceph-osd, ethernet card queues and kvm instance pinned to different sets of cores, nothing changes. On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum gregory.far...@dreamhost.com wrote: It sounds like maybe you're using Xen? The rbd writeback window option only works for userspace rbd implementations (eg, KVM). If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 (~8MB). What options are you running dd with? If you run a rados bench from both machines, what do the results look like? Also, can you do the ceph osd bench on each of your OSDs, please? (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) -Greg On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: More strangely, writing speed drops down by fifteen percent when this option was set in vm` config(instead of result from http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes under heavy load. On Sun, Mar 18, 2012 at 10:22 PM, Sage Weils...@newdream.net (mailto:s...@newdream.net) wrote: On Sat, 17 Mar 2012, Andrey Korolyov wrote: Hi, I`ve did some performance tests at the following configuration: mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - dom0 with three dedicated cores and 1.5G, mostly idle. First three disks on each r410 arranged into raid0 and holds osd data when fourth holds os and osd` journal partition, all ceph-related stuff mounted on the ext4 without barriers. Firstly, I`ve noticed about a difference of benchmark performance and write speed through rbd from small kvm instance running on one of first two machines - when bench gave me about 110Mb/s, writing zeros to raw block device inside vm with dd was at top speed about 45 mb/s, for vm`fs (ext4 with default options) performance drops to ~23Mb/s. Things get worse, when I`ve started second vm at second host and tried to continue same dd tests simultaneously - performance fairly divided by half for each instance :). Enabling jumbo frames, playing with cpu affinity for ceph and vm instances and trying different TCP congestion protocols gave no effect at all - with DCTCP I have slightly smoother network load graph and that`s all. Can ml please suggest anything to try to improve performance? Can you try setting rbd writeback window = 8192000 or similar, and see what kind of effect that has? I suspect it'll speed up dd; I'm less sure about ext3. Thanks! sage ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at
Re: Ceph mon crash
2012/3/19 Greg Farnum gregory.far...@dreamhost.com: On Monday, March 19, 2012 at 7:33 AM, ruslan usifov wrote: Hello I have follow stack trace: #0 0xb77fa424 in __kernel_vsyscall () (gdb) bt #0 0xb77fa424 in __kernel_vsyscall () #1 0xb77e98a0 in raise () from /lib/i386-linux-gnu/ libpthread.so.0 #2 0x08230f8b in ?? () #3 signal handler called #4 0xb77fa424 in __kernel_vsyscall () #5 0xb70eae71 in raise () from /lib/i386-linux-gnu/libc.so.6 #6 0xb70ee34e in abort () from /lib/i386-linux-gnu/libc.so.6 #7 0xb73130b5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/i386-linux-gnu/libstdc++.so.6 #8 0xb7310fa5 in ?? () from /usr/lib/i386-linux-gnu/libstdc++.so.6 #9 0xb7310fe2 in std::terminate() () from /usr/lib/i386-linux-gnu/libstdc++.so.6 #10 0xb731114e in __cxa_throw () from /usr/lib/i386-linux-gnu/libstdc++.so.6 #11 0x0822f8c7 in ceph::__ceph_assert_fail(char const*, char const*, int, char const*) () #12 0x081cf8a4 in MDSMap::get_health(std::basic_ostreamchar, std::char_traitschar ) const () #13 0x0811e8a7 in MDSMonitor::get_health(std::basic_ostreamchar, std::char_traitschar ) const () #14 0x080c4977 in Monitor::handle_command(MMonCommand*) () #15 0x080cf244 in Monitor::_ms_dispatch(Message*) () #16 0x080df1a4 in Monitor::ms_dispatch(Message*) () #17 0x081f706d in SimpleMessenger::dispatch_entry() () #18 0x080b27d2 in SimpleMessenger::DispatchThread::entry() () #19 0x081b5d81 in Thread::_entry_func(void*) () #20 0xb77e0e99 in start_thread () from /lib/i386-linux-gnu/libpthread.so.0 #21 0xb71919ee in clone () from /lib/i386-linux-gnu/libc.so.6 Can you get the line number from frame 12? (f 12 enter, then just paste the output) Also the output of ceph -s if things are still running. The only assert I see in get_health() is that each up MDS be in mds_info, which really ought to be true…. Sorry but no, i use precompiled binaries from this http://ceph.newdream.net/debian. Perhaps this helps, initialy i configure all ceph services mon, mds, osd, but then i test only rdb and remove all mds from cluster (3 vmware machines) throw follow command: ceph mds rm 1 (i write this lines by memory so can mistaken in syntax) And when one mon crashes all other monitors in cluster will crashes too:-((. So one time in cluster not any alive mons Yeah, this is because the crash is being triggered by a get_health command and it's trying it out on each monitor in turn as they fail. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Logging braindump
So, we've been talking about an in-memory buffer that would contain debug-level messages, and a separate thread/process [1] that would write a subset of these to disk. Thus, on crash, we'd have a lot of detail available (the ring buffer entries not yet overwritten), without paying the IO/serialization/storage cost of dumping it all out to disk. [1] e.g. varnish uses an mmaped file for a ringbuffer, consumes it from a difference process; that way, the log consumer cannot corrupt the server process memory. Of course, this means log messages cannot contain pointers to indirect data. And that means the buffer is not made of constant size entries, like Disruptor tends to assume.. though I think you could view Disruptor sequence numbers as byte offsets, if you wanted. RING BUFFER For the ring buffer, we've been looking at Disruptor[2]-style consumer tries to catch up with a sequence number from producer design. As we have multiple producers (threads creating log entries), the cost of creating a log entry would be a single cmpxchg op, and then whatever work is needed to lay out the event in the ringbuffer. http://martinfowler.com/articles/lmax.html http://code.google.com/p/disruptor/ The in-memory data format could just use whatever data format is most convenient. The ringbuffer could be an array tiny structs with the base fields like thread id embedded there, and pointers to separately allocated data for items that aren't always present. But this means we need to be very careful about memory management; we want the data pointed to to stay alive and unmodified until the producer loops around the ringbuffer. Alternatively, interpret Disruptor sequence numbers as byte offsets, serialize message first, allocate that much space from ring buffer (still just one cmpxchg). This pushes more of the work to the producer of log messages, but avoids having an intermediate data format that needs to be converted to another format, and simplifies memory management tremendously. DISK FORMAT The process writing the events to disk should be more focused on usefulness and longevity of the data. If the ring buffer is just arrays with pointers, here we should take the performance hit to convert to one of the known formats. I feel strongly in favor of structured logging, as parsing bazillion log entries is slow, and maintaining the parsing rules is actually harder than structured logging in the first place. The status quo is hoping to improve syslog, but there's so much Enterprise in this stuff that I'm not holding my breath.. http://lwn.net/Articles/484731/ . Work that has come out includes the structured syslog format below, CEE querying further down. Some candidates: - Scribe and Flume are pre-existing log collectors that emphasize a DAG of log flow, lots of Java everywhere.. I'm not thrilled. https://github.com/facebook/scribe https://cwiki.apache.org/FLUME/ - journald: I'm just going to pretend it doesn't exist, at least for 2 years: http://blog.gerhards.net/2011/11/journald-and-rsyslog.html - syslog's structured logging extension: http://tools.ietf.org/html/rfc5424#section-6.5 essentially, [key=value key2=val2] MESSAGE 1651 2003-10-11T22:14:15.003Z mymachine.example.com evntslog - ID47 [exampleSDID@32473 iut=3 eventSource= Application eventID=1011] BOMAn application event log entry... - JSON: http://json.org/ {key: value, key2: val2, message: MESSAGE} the convention is lines of json, separated by newlines, each line is a full event biggest downside is numbers are always floats (need to stringify large numbers), binary data must be encoded somehow into a utf-8 string (base64 is most common). - GELF: compressed JSON with specific fields: https://github.com/Graylog2/graylog2-docs/wiki/GELF - Google Protocol Buffers: considered clumsy these days (code generation from IDL etc); only Google has significant investment in the format - Thrift: considered clumsy these days (code generation from IDL etc); only Facebook has significant investment in the format - BSON: sort of close to binary encoding of JSON + extra data types, not a huge improvement in speed/space.. http://bsonspec.org/ - Avro: Apache-sponspored data format, nicely self-describing, apparently slow? http://avro.apache.org/ - MessagePack: binary encoding for JSON, claims to beat others in speed.. http://msgpack.org/ And all of these can be compressed with e.g. Snappy as they flow to disk. http://code.google.com/p/snappy/ Downside of just all but JSON: we'd need to bundle the library -- distro support just isn't there yet. Should the disk format be binary? That makes it less friendly to the admin. I'm not sure which way to go. JSON is simpler and friendlier, e.g. MessagePack has identical data model but is faster and takes less space. Some options: a. make configurable so simple installations don't need to suffer binary logs b. just pick one and stick with it QUERYING / ANALYSIS - use a format
Re: Ceph mon crash
On Monday, March 19, 2012 at 11:44 AM, ruslan usifov wrote: Sorry but no, i use precompiled binaries from this http://ceph.newdream.net/debian. Perhaps this helps, initialy i configure all ceph services mon, mds, osd, but then i test only rdb and remove all mds from cluster (3 vmware machines) throw follow command: ceph mds rm 1 (i write this lines by memory so can mistaken in syntax) Oh. That's a fun command! Where on earth did you find it documented? Unfortunately, it's only supposed to be used when things get weird. (And really, I'm not sure when it would be appropriate.) If you run it on a healthy cluster, it will break things. I created a bug to make it not do that: http://tracker.newdream.net/issues/2188 If necessary I can figure out how to create a good MDSMap and inject it into your monitors, but I'd rather not if you don't have any data in there. (In which case, reformat the cluster.) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html