Re: [ceph-users] How to improve single thread sequential reads?
> On 18 Aug 2015, at 13:58, Nick Fisk wrote: > > > > > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Jan Schermer >> Sent: 18 August 2015 12:41 >> To: Nick Fisk >> Cc: ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] How to improve single thread sequential reads? >> >> Reply in text >> >>> On 18 Aug 2015, at 12:59, Nick Fisk wrote: >>> >>> >>> >>>> -Original Message- >>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >>>> Of Jan Schermer >>>> Sent: 18 August 2015 11:50 >>>> To: Benedikt Fraunhofer >>> users.ceph.com.toasta@traced.net> >>>> Cc: ceph-users@lists.ceph.com; Nick Fisk >>>> Subject: Re: [ceph-users] How to improve single thread sequential > reads? >>>> >>>> I'm not sure if I missed that but are you testing in a VM backed by >>>> RBD device, or using the device directly? >>>> >>>> I don't see how blk-mq would help if it's not a VM, it just passes >>>> the >>> request >>>> to the underlying block device, and in case of RBD there is no real >>>> block device from the host perspective...? Enlighten me if I'm wrong >>>> please. I >>> have >>>> some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me >>>> cringe because I'm unable to tune the scheduler and it just makes no >>>> sense at all...? >>> >>> Since 4.0 (I think) the Kernel RBD client now uses the blk-mq >>> infrastructure, but there is a bug which limits max IO sizes to 128kb, >>> which is why for large block/sequential that testing kernel is >>> essential. I think this bug fix should make it to 4.2 hopefully. >> >> blk-mq is supposed to remove redundancy of having >> >> IO scheduler in VM -> VM block device -> host IO scheduler -> block device >> >> it's a paravirtualized driver that just moves requests from inside the VM > to >> the host queue (and this is why inside the VM you have no IO scheduler >> options - it effectively becomes noop). >> >> But this just doesn't make sense if you're using qemu with librbd - > there's no >> host queue. >> It would make sense if the qemu drive was krbd device with a queue. >> >> If there's no VM there should be no blk-mq? > > I think you might be thinking about the virtio-blk driver for blk-mq. Blk-mq > itself seems to be a lot more about enhancing the overall block layer > performance in Linux > > https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mec > hanism_(blk-mq) > > > >> >> So what was added to the kernel was probably the host-side infrastructure >> to handle blk-mq in guest passthrough to the krdb device, but that's > probably >> not your case, is it? >> >>> >>>> >>>> Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb >>>> (to make sure it gets into readahead), also try (if you're not using >>>> blk-mq) >>> to a >>>> cfq scheduler and set it to rotational=1. I see you've also tried >>>> this, >>> but I think >>>> blk-mq is the limiting factor here now. >>> >>> I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals >>> object size, from what I can tell) and the max_sectors_kb is already >>> set at the hw_max. But it would sure be nice if the max_hw_sectors_kb >>> could be set higher though, but I'm not sure if there is a reason for > this >> limit. >>> >>>> >>>> If you are running a single-threaded benchmark like rados bench then >>> what's >>>> limiting you is latency - it's not surprising it scales up with more >>> threads. >>> >>> Agreed, but with sequential workloads, if you can get readahead >>> working properly then you can easily remove this limitation as a >>> single threaded op effectively becomes multithreaded. >> >> Thinking on this more - I don't know if this will help after all, it will > still be a >> single thread, just trying to get ahead of the client IO - and that's not > likely to >> happen unless you can read the data in userspace slower than what Ceph >> can read... >> >> I thi
Re: [ceph-users] How to improve single thread sequential reads?
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Jan Schermer > Sent: 18 August 2015 12:41 > To: Nick Fisk > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] How to improve single thread sequential reads? > > Reply in text > > > On 18 Aug 2015, at 12:59, Nick Fisk wrote: > > > > > > > >> -Original Message- > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > >> Of Jan Schermer > >> Sent: 18 August 2015 11:50 > >> To: Benedikt Fraunhofer >> users.ceph.com.toasta@traced.net> > >> Cc: ceph-users@lists.ceph.com; Nick Fisk > >> Subject: Re: [ceph-users] How to improve single thread sequential reads? > >> > >> I'm not sure if I missed that but are you testing in a VM backed by > >> RBD device, or using the device directly? > >> > >> I don't see how blk-mq would help if it's not a VM, it just passes > >> the > > request > >> to the underlying block device, and in case of RBD there is no real > >> block device from the host perspective...? Enlighten me if I'm wrong > >> please. I > > have > >> some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me > >> cringe because I'm unable to tune the scheduler and it just makes no > >> sense at all...? > > > > Since 4.0 (I think) the Kernel RBD client now uses the blk-mq > > infrastructure, but there is a bug which limits max IO sizes to 128kb, > > which is why for large block/sequential that testing kernel is > > essential. I think this bug fix should make it to 4.2 hopefully. > > blk-mq is supposed to remove redundancy of having > > IO scheduler in VM -> VM block device -> host IO scheduler -> block device > > it's a paravirtualized driver that just moves requests from inside the VM to > the host queue (and this is why inside the VM you have no IO scheduler > options - it effectively becomes noop). > > But this just doesn't make sense if you're using qemu with librbd - there's no > host queue. > It would make sense if the qemu drive was krbd device with a queue. > > If there's no VM there should be no blk-mq? I think you might be thinking about the virtio-blk driver for blk-mq. Blk-mq itself seems to be a lot more about enhancing the overall block layer performance in Linux https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mec hanism_(blk-mq) > > So what was added to the kernel was probably the host-side infrastructure > to handle blk-mq in guest passthrough to the krdb device, but that's probably > not your case, is it? > > > > >> > >> Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb > >> (to make sure it gets into readahead), also try (if you're not using > >> blk-mq) > > to a > >> cfq scheduler and set it to rotational=1. I see you've also tried > >> this, > > but I think > >> blk-mq is the limiting factor here now. > > > > I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals > > object size, from what I can tell) and the max_sectors_kb is already > > set at the hw_max. But it would sure be nice if the max_hw_sectors_kb > > could be set higher though, but I'm not sure if there is a reason for this > limit. > > > >> > >> If you are running a single-threaded benchmark like rados bench then > > what's > >> limiting you is latency - it's not surprising it scales up with more > > threads. > > > > Agreed, but with sequential workloads, if you can get readahead > > working properly then you can easily remove this limitation as a > > single threaded op effectively becomes multithreaded. > > Thinking on this more - I don't know if this will help after all, it will still be a > single thread, just trying to get ahead of the client IO - and that's not likely to > happen unless you can read the data in userspace slower than what Ceph > can read... > > I think striping multiple device could be the answer after all. But have you > tried creating the RBD volume as striped in Ceph? Yes striping would probably give amazing performance, but the kernel client currently doesn't support it, which leaves us in the position of trying to find work arounds to boost performance. Although the client read is single threaded, the RBD/RADOS layer would split these larger readahead IOs into 4MB requests that would then be processed in parallel by
Re: [ceph-users] How to improve single thread sequential reads?
Reply in text > On 18 Aug 2015, at 12:59, Nick Fisk wrote: > > > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Jan Schermer >> Sent: 18 August 2015 11:50 >> To: Benedikt Fraunhofer > users.ceph.com.toasta@traced.net> >> Cc: ceph-users@lists.ceph.com; Nick Fisk >> Subject: Re: [ceph-users] How to improve single thread sequential reads? >> >> I'm not sure if I missed that but are you testing in a VM backed by RBD >> device, or using the device directly? >> >> I don't see how blk-mq would help if it's not a VM, it just passes the > request >> to the underlying block device, and in case of RBD there is no real block >> device from the host perspective...? Enlighten me if I'm wrong please. I > have >> some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me >> cringe because I'm unable to tune the scheduler and it just makes no sense >> at all...? > > Since 4.0 (I think) the Kernel RBD client now uses the blk-mq > infrastructure, but there is a bug which limits max IO sizes to 128kb, which > is why for large block/sequential that testing kernel is essential. I think > this bug fix should make it to 4.2 hopefully. blk-mq is supposed to remove redundancy of having IO scheduler in VM -> VM block device -> host IO scheduler -> block device it's a paravirtualized driver that just moves requests from inside the VM to the host queue (and this is why inside the VM you have no IO scheduler options - it effectively becomes noop). But this just doesn't make sense if you're using qemu with librbd - there's no host queue. It would make sense if the qemu drive was krbd device with a queue. If there's no VM there should be no blk-mq? So what was added to the kernel was probably the host-side infrastructure to handle blk-mq in guest passthrough to the krdb device, but that's probably not your case, is it? > >> >> Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to >> make sure it gets into readahead), also try (if you're not using blk-mq) > to a >> cfq scheduler and set it to rotational=1. I see you've also tried this, > but I think >> blk-mq is the limiting factor here now. > > I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals object > size, from what I can tell) and the max_sectors_kb is already set at the > hw_max. But it would sure be nice if the max_hw_sectors_kb could be set > higher though, but I'm not sure if there is a reason for this limit. > >> >> If you are running a single-threaded benchmark like rados bench then > what's >> limiting you is latency - it's not surprising it scales up with more > threads. > > Agreed, but with sequential workloads, if you can get readahead working > properly then you can easily remove this limitation as a single threaded op > effectively becomes multithreaded. Thinking on this more - I don't know if this will help after all, it will still be a single thread, just trying to get ahead of the client IO - and that's not likely to happen unless you can read the data in userspace slower than what Ceph can read... I think striping multiple device could be the answer after all. But have you tried creating the RBD volume as striped in Ceph? > >> It should run nicely with a real workload once readahead kicks in and the >> queue fills up. But again - not sure how that works with blk-mq and I've >> never used the RBD device directly (the kernel client). Does it show in >> /sys/block ? Can you dump "find /sys/block/$rbd" in here? >> >> Jan >> >> >>> On 18 Aug 2015, at 12:25, Benedikt Fraunhofer > users.ceph.com.toasta@traced.net> wrote: >>> >>> Hi Nick, >>> >>> did you do anything fancy to get to ~90MB/s in the first place? >>> I'm stuck at ~30MB/s reading cold data. single-threaded-writes are >>> quite speedy, around 600MB/s. >>> >>> radosgw for cold data is around the 90MB/s, which is imho limitted by >>> the speed of a single disk. >>> >>> Data already present on the osd-os-buffers arrive with around >>> 400-700MB/s so I don't think the network is the culprit. >>> >>> (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds >>> each, lacp 2x10g bonds) >>> >>> rados bench single-threaded performs equally bad, but with its default >>> multithreaded settings it generates wonderful numbers, usually only >>> limiited by lin
Re: [ceph-users] How to improve single thread sequential reads?
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Jan Schermer > Sent: 18 August 2015 11:50 > To: Benedikt Fraunhofer users.ceph.com.toasta@traced.net> > Cc: ceph-users@lists.ceph.com; Nick Fisk > Subject: Re: [ceph-users] How to improve single thread sequential reads? > > I'm not sure if I missed that but are you testing in a VM backed by RBD > device, or using the device directly? > > I don't see how blk-mq would help if it's not a VM, it just passes the request > to the underlying block device, and in case of RBD there is no real block > device from the host perspective...? Enlighten me if I'm wrong please. I have > some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me > cringe because I'm unable to tune the scheduler and it just makes no sense > at all...? Since 4.0 (I think) the Kernel RBD client now uses the blk-mq infrastructure, but there is a bug which limits max IO sizes to 128kb, which is why for large block/sequential that testing kernel is essential. I think this bug fix should make it to 4.2 hopefully. > > Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to > make sure it gets into readahead), also try (if you're not using blk-mq) to a > cfq scheduler and set it to rotational=1. I see you've also tried this, but I think > blk-mq is the limiting factor here now. I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals object size, from what I can tell) and the max_sectors_kb is already set at the hw_max. But it would sure be nice if the max_hw_sectors_kb could be set higher though, but I'm not sure if there is a reason for this limit. > > If you are running a single-threaded benchmark like rados bench then what's > limiting you is latency - it's not surprising it scales up with more threads. Agreed, but with sequential workloads, if you can get readahead working properly then you can easily remove this limitation as a single threaded op effectively becomes multithreaded. > It should run nicely with a real workload once readahead kicks in and the > queue fills up. But again - not sure how that works with blk-mq and I've > never used the RBD device directly (the kernel client). Does it show in > /sys/block ? Can you dump "find /sys/block/$rbd" in here? > > Jan > > > > On 18 Aug 2015, at 12:25, Benedikt Fraunhofer users.ceph.com.toasta@traced.net> wrote: > > > > Hi Nick, > > > > did you do anything fancy to get to ~90MB/s in the first place? > > I'm stuck at ~30MB/s reading cold data. single-threaded-writes are > > quite speedy, around 600MB/s. > > > > radosgw for cold data is around the 90MB/s, which is imho limitted by > > the speed of a single disk. > > > > Data already present on the osd-os-buffers arrive with around > > 400-700MB/s so I don't think the network is the culprit. > > > > (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds > > each, lacp 2x10g bonds) > > > > rados bench single-threaded performs equally bad, but with its default > > multithreaded settings it generates wonderful numbers, usually only > > limiited by linerate and/or interrupts/s. > > > > I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to > > get to "your wonderful" numbers, but it's staying below 30 MB/s. > > > > I was thinking about using a software raid0 like you did but that's > > imho really ugly. > > When I know I needed something speedy, I usually just started dd-ing > > the file to /dev/null and wait for about three minutes before > > starting the actual job; some sort of hand-made read-ahead for > > dummies. > > > > Thx in advance > > Benedikt > > > > > > 2015-08-17 13:29 GMT+02:00 Nick Fisk : > >> Thanks for the replies guys. > >> > >> The client is set to 4MB, I haven't played with the OSD side yet as I > >> wasn't sure if it would make much difference, but I will give it a > >> go. If the client is already passing a 4MB request down through to > >> the OSD, will it be able to readahead any further? The next 4MB > >> object in theory will be on another OSD and so I'm not sure if > >> reading ahead any further on the OSD side would help. > >> > >> How I see the problem is that the RBD client will only read 1 OSD at > >> a time as the RBD readahead can't be set any higher than > >> max_hw_sectors_kb, which is the object size of the RBD. Please correct > me if I'm wrong on this. > &
Re: [ceph-users] How to improve single thread sequential reads?
I'm not sure if I missed that but are you testing in a VM backed by RBD device, or using the device directly? I don't see how blk-mq would help if it's not a VM, it just passes the request to the underlying block device, and in case of RBD there is no real block device from the host perspective...? Enlighten me if I'm wrong please. I have some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me cringe because I'm unable to tune the scheduler and it just makes no sense at all...? Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to make sure it gets into readahead), also try (if you're not using blk-mq) to a cfq scheduler and set it to rotational=1. I see you've also tried this, but I think blk-mq is the limiting factor here now. If you are running a single-threaded benchmark like rados bench then what's limiting you is latency - it's not surprising it scales up with more threads. It should run nicely with a real workload once readahead kicks in and the queue fills up. But again - not sure how that works with blk-mq and I've never used the RBD device directly (the kernel client). Does it show in /sys/block ? Can you dump "find /sys/block/$rbd" in here? Jan > On 18 Aug 2015, at 12:25, Benedikt Fraunhofer > wrote: > > Hi Nick, > > did you do anything fancy to get to ~90MB/s in the first place? > I'm stuck at ~30MB/s reading cold data. single-threaded-writes are > quite speedy, around 600MB/s. > > radosgw for cold data is around the 90MB/s, which is imho limitted by > the speed of a single disk. > > Data already present on the osd-os-buffers arrive with around > 400-700MB/s so I don't think the network is the culprit. > > (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds > each, lacp 2x10g bonds) > > rados bench single-threaded performs equally bad, but with its default > multithreaded settings it generates wonderful numbers, usually only > limiited by linerate and/or interrupts/s. > > I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to > get to "your wonderful" numbers, but it's staying below 30 MB/s. > > I was thinking about using a software raid0 like you did but that's > imho really ugly. > When I know I needed something speedy, I usually just started dd-ing > the file to /dev/null and wait for about three minutes before > starting the actual job; some sort of hand-made read-ahead for > dummies. > > Thx in advance > Benedikt > > > 2015-08-17 13:29 GMT+02:00 Nick Fisk : >> Thanks for the replies guys. >> >> The client is set to 4MB, I haven't played with the OSD side yet as I wasn't >> sure if it would make much difference, but I will give it a go. If the >> client is already passing a 4MB request down through to the OSD, will it be >> able to readahead any further? The next 4MB object in theory will be on >> another OSD and so I'm not sure if reading ahead any further on the OSD side >> would help. >> >> How I see the problem is that the RBD client will only read 1 OSD at a time >> as the RBD readahead can't be set any higher than max_hw_sectors_kb, which >> is the object size of the RBD. Please correct me if I'm wrong on this. >> >> If you could set the RBD readahead to much higher than the object size, then >> this would probably give the desired effect where the buffer could be >> populated by reading from several OSD's in advance to give much higher >> performance. That or wait for striping to appear in the Kernel client. >> >> I've also found that BareOS (fork of Bacula) seems to has a direct RADOS >> feature that supports radosstriper. I might try this and see how it performs >> as well. >> >> >>> -Original Message- >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >>> Somnath Roy >>> Sent: 17 August 2015 03:36 >>> To: Alex Gorbachev ; Nick Fisk >>> Cc: ceph-users@lists.ceph.com >>> Subject: Re: [ceph-users] How to improve single thread sequential reads? >>> >>> Have you tried setting read_ahead_kb to bigger number for both client/OSD >>> side if you are using krbd ? >>> In case of librbd, try the different config options for rbd cache.. >>> >>> Thanks & Regards >>> Somnath >>> >>> -Original Message- >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >>> Alex Gorbachev >>> Sent: Sunday, August 16, 2015 7:07 PM >>> To: Nick Fisk >>> Cc: ceph-users@lists.ceph.com
Re: [ceph-users] How to improve single thread sequential reads?
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Benedikt Fraunhofer > Sent: 18 August 2015 11:25 > To: Nick Fisk > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] How to improve single thread sequential reads? > > Hi Nick, > > did you do anything fancy to get to ~90MB/s in the first place? > I'm stuck at ~30MB/s reading cold data. single-threaded-writes are quite > speedy, around 600MB/s. I only bumped up the read ahead to 4096, apart from that I didn't change anything else. This was probably done on a reasonably quite cluster, if the cluster is doing other things sequential IO is normally the 1st to suffer. However please look for a thread I started a few months ago where I was getting very poor performance in reading data that had been sitting dormant for a while. It turned out to be something to do with taking a long time to retrieve xattrs, but unfortunately I never got to the bottom of it. I don't know if this is something you might also be experiencing? > > radosgw for cold data is around the 90MB/s, which is imho limitted by the > speed of a single disk. > > Data already present on the osd-os-buffers arrive with around 400-700MB/s > so I don't think the network is the culprit. > > (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds each, lacp > 2x10g bonds) > > rados bench single-threaded performs equally bad, but with its default > multithreaded settings it generates wonderful numbers, usually only limiited > by linerate and/or interrupts/s. > > I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to get to > "your wonderful" numbers, but it's staying below 30 MB/s. You will need this testing kernel for the blk-mq fixes, anything other than that at the moment will limit your max IO size. http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/testing_blk-m q-plug/ > > I was thinking about using a software raid0 like you did but that's imho really > ugly. > When I know I needed something speedy, I usually just started dd-ing the > file to /dev/null and wait for about three minutes before starting the actual > job; some sort of hand-made read-ahead for dummies. > > Thx in advance > Benedikt > > > 2015-08-17 13:29 GMT+02:00 Nick Fisk : > > Thanks for the replies guys. > > > > The client is set to 4MB, I haven't played with the OSD side yet as I > > wasn't sure if it would make much difference, but I will give it a go. > > If the client is already passing a 4MB request down through to the > > OSD, will it be able to readahead any further? The next 4MB object in > > theory will be on another OSD and so I'm not sure if reading ahead any > > further on the OSD side would help. > > > > How I see the problem is that the RBD client will only read 1 OSD at a > > time as the RBD readahead can't be set any higher than > > max_hw_sectors_kb, which is the object size of the RBD. Please correct me > if I'm wrong on this. > > > > If you could set the RBD readahead to much higher than the object > > size, then this would probably give the desired effect where the > > buffer could be populated by reading from several OSD's in advance to > > give much higher performance. That or wait for striping to appear in the > Kernel client. > > > > I've also found that BareOS (fork of Bacula) seems to has a direct > > RADOS feature that supports radosstriper. I might try this and see how > > it performs as well. > > > > > >> -Original Message- > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > >> Of Somnath Roy > >> Sent: 17 August 2015 03:36 > >> To: Alex Gorbachev ; Nick Fisk > >> > >> Cc: ceph-users@lists.ceph.com > >> Subject: Re: [ceph-users] How to improve single thread sequential reads? > >> > >> Have you tried setting read_ahead_kb to bigger number for both > >> client/OSD side if you are using krbd ? > >> In case of librbd, try the different config options for rbd cache.. > >> > >> Thanks & Regards > >> Somnath > >> > >> -Original Message- > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > >> Of Alex Gorbachev > >> Sent: Sunday, August 16, 2015 7:07 PM > >> To: Nick Fisk > >> Cc: ceph-users@lists.ceph.com > >> Subject: Re: [ceph-users] How to improve single thread sequential reads? > >> > >> Hi Nick, > >> > >> On Thu, Aug 13, 2015 at 4:37 PM, Nic
Re: [ceph-users] How to improve single thread sequential reads?
On 18-08-15 12:25, Benedikt Fraunhofer wrote: > Hi Nick, > > did you do anything fancy to get to ~90MB/s in the first place? > I'm stuck at ~30MB/s reading cold data. single-threaded-writes are > quite speedy, around 600MB/s. > > radosgw for cold data is around the 90MB/s, which is imho limitted by > the speed of a single disk. > > Data already present on the osd-os-buffers arrive with around > 400-700MB/s so I don't think the network is the culprit. > > (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds > each, lacp 2x10g bonds) > > rados bench single-threaded performs equally bad, but with its default > multithreaded settings it generates wonderful numbers, usually only > limiited by linerate and/or interrupts/s. > > I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to > get to "your wonderful" numbers, but it's staying below 30 MB/s. > > I was thinking about using a software raid0 like you did but that's > imho really ugly. > When I know I needed something speedy, I usually just started dd-ing > the file to /dev/null and wait for about three minutes before > starting the actual job; some sort of hand-made read-ahead for > dummies. > It really depends on your situation, but you could also go for larger objects then 4MB for specific block devices. In a use-case with a customer where they read large single-thread files from RBD block devices we went for 64MB objects. That improved our read performance in that case. We didn't have to create a new TCP connection every 4MB and talk to a new OSD. You could try that and see how it works out. Wido > Thx in advance > Benedikt > > > 2015-08-17 13:29 GMT+02:00 Nick Fisk : >> Thanks for the replies guys. >> >> The client is set to 4MB, I haven't played with the OSD side yet as I wasn't >> sure if it would make much difference, but I will give it a go. If the >> client is already passing a 4MB request down through to the OSD, will it be >> able to readahead any further? The next 4MB object in theory will be on >> another OSD and so I'm not sure if reading ahead any further on the OSD side >> would help. >> >> How I see the problem is that the RBD client will only read 1 OSD at a time >> as the RBD readahead can't be set any higher than max_hw_sectors_kb, which >> is the object size of the RBD. Please correct me if I'm wrong on this. >> >> If you could set the RBD readahead to much higher than the object size, then >> this would probably give the desired effect where the buffer could be >> populated by reading from several OSD's in advance to give much higher >> performance. That or wait for striping to appear in the Kernel client. >> >> I've also found that BareOS (fork of Bacula) seems to has a direct RADOS >> feature that supports radosstriper. I might try this and see how it performs >> as well. >> >> >>> -Original Message- >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >>> Somnath Roy >>> Sent: 17 August 2015 03:36 >>> To: Alex Gorbachev ; Nick Fisk >>> Cc: ceph-users@lists.ceph.com >>> Subject: Re: [ceph-users] How to improve single thread sequential reads? >>> >>> Have you tried setting read_ahead_kb to bigger number for both client/OSD >>> side if you are using krbd ? >>> In case of librbd, try the different config options for rbd cache.. >>> >>> Thanks & Regards >>> Somnath >>> >>> -Original Message- >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >>> Alex Gorbachev >>> Sent: Sunday, August 16, 2015 7:07 PM >>> To: Nick Fisk >>> Cc: ceph-users@lists.ceph.com >>> Subject: Re: [ceph-users] How to improve single thread sequential reads? >>> >>> Hi Nick, >>> >>> On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk wrote: >>>>> -Original Message- >>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >>>>> Of Nick Fisk >>>>> Sent: 13 August 2015 18:04 >>>>> To: ceph-users@lists.ceph.com >>>>> Subject: [ceph-users] How to improve single thread sequential reads? >>>>> >>>>> Hi, >>>>> >>>>> I'm trying to use a RBD to act as a staging area for some data before >>>> pushing >>>>> it down to some LTO6 tapes. As I cannot use striping with the kernel >>>> client I >>>>> tend to
Re: [ceph-users] How to improve single thread sequential reads?
Hi Nick, did you do anything fancy to get to ~90MB/s in the first place? I'm stuck at ~30MB/s reading cold data. single-threaded-writes are quite speedy, around 600MB/s. radosgw for cold data is around the 90MB/s, which is imho limitted by the speed of a single disk. Data already present on the osd-os-buffers arrive with around 400-700MB/s so I don't think the network is the culprit. (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds each, lacp 2x10g bonds) rados bench single-threaded performs equally bad, but with its default multithreaded settings it generates wonderful numbers, usually only limiited by linerate and/or interrupts/s. I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to get to "your wonderful" numbers, but it's staying below 30 MB/s. I was thinking about using a software raid0 like you did but that's imho really ugly. When I know I needed something speedy, I usually just started dd-ing the file to /dev/null and wait for about three minutes before starting the actual job; some sort of hand-made read-ahead for dummies. Thx in advance Benedikt 2015-08-17 13:29 GMT+02:00 Nick Fisk : > Thanks for the replies guys. > > The client is set to 4MB, I haven't played with the OSD side yet as I wasn't > sure if it would make much difference, but I will give it a go. If the > client is already passing a 4MB request down through to the OSD, will it be > able to readahead any further? The next 4MB object in theory will be on > another OSD and so I'm not sure if reading ahead any further on the OSD side > would help. > > How I see the problem is that the RBD client will only read 1 OSD at a time > as the RBD readahead can't be set any higher than max_hw_sectors_kb, which > is the object size of the RBD. Please correct me if I'm wrong on this. > > If you could set the RBD readahead to much higher than the object size, then > this would probably give the desired effect where the buffer could be > populated by reading from several OSD's in advance to give much higher > performance. That or wait for striping to appear in the Kernel client. > > I've also found that BareOS (fork of Bacula) seems to has a direct RADOS > feature that supports radosstriper. I might try this and see how it performs > as well. > > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Somnath Roy >> Sent: 17 August 2015 03:36 >> To: Alex Gorbachev ; Nick Fisk >> Cc: ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] How to improve single thread sequential reads? >> >> Have you tried setting read_ahead_kb to bigger number for both client/OSD >> side if you are using krbd ? >> In case of librbd, try the different config options for rbd cache.. >> >> Thanks & Regards >> Somnath >> >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Alex Gorbachev >> Sent: Sunday, August 16, 2015 7:07 PM >> To: Nick Fisk >> Cc: ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] How to improve single thread sequential reads? >> >> Hi Nick, >> >> On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk wrote: >> >> -Original Message- >> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >> >> Of Nick Fisk >> >> Sent: 13 August 2015 18:04 >> >> To: ceph-users@lists.ceph.com >> >> Subject: [ceph-users] How to improve single thread sequential reads? >> >> >> >> Hi, >> >> >> >> I'm trying to use a RBD to act as a staging area for some data before >> > pushing >> >> it down to some LTO6 tapes. As I cannot use striping with the kernel >> > client I >> >> tend to be maxing out at around 80MB/s reads testing with DD. Has >> >> anyone got any clever suggestions of giving this a bit of a boost, I >> >> think I need >> > to get it >> >> up to around 200MB/s to make sure there is always a steady flow of >> >> data to the tape drive. >> > >> > I've just tried the testing kernel with the blk-mq fixes in it for >> > full size IO's, this combined with bumping readahead up to 4MB, is now >> > getting me on average 150MB/s to 200MB/s so this might suffice. >> > >> > On a personal interest, I would still like to know if anyone has ideas >> > on how to really push much higher bandwidth through a RBD. >> >> Some settings in our ceph.conf that may help: >> >> osd_op_threads = 20 >> o
Re: [ceph-users] How to improve single thread sequential reads?
Thanks for the replies guys. The client is set to 4MB, I haven't played with the OSD side yet as I wasn't sure if it would make much difference, but I will give it a go. If the client is already passing a 4MB request down through to the OSD, will it be able to readahead any further? The next 4MB object in theory will be on another OSD and so I'm not sure if reading ahead any further on the OSD side would help. How I see the problem is that the RBD client will only read 1 OSD at a time as the RBD readahead can't be set any higher than max_hw_sectors_kb, which is the object size of the RBD. Please correct me if I'm wrong on this. If you could set the RBD readahead to much higher than the object size, then this would probably give the desired effect where the buffer could be populated by reading from several OSD's in advance to give much higher performance. That or wait for striping to appear in the Kernel client. I've also found that BareOS (fork of Bacula) seems to has a direct RADOS feature that supports radosstriper. I might try this and see how it performs as well. > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Somnath Roy > Sent: 17 August 2015 03:36 > To: Alex Gorbachev ; Nick Fisk > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] How to improve single thread sequential reads? > > Have you tried setting read_ahead_kb to bigger number for both client/OSD > side if you are using krbd ? > In case of librbd, try the different config options for rbd cache.. > > Thanks & Regards > Somnath > > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Alex Gorbachev > Sent: Sunday, August 16, 2015 7:07 PM > To: Nick Fisk > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] How to improve single thread sequential reads? > > Hi Nick, > > On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk wrote: > >> -Original Message- > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > >> Of Nick Fisk > >> Sent: 13 August 2015 18:04 > >> To: ceph-users@lists.ceph.com > >> Subject: [ceph-users] How to improve single thread sequential reads? > >> > >> Hi, > >> > >> I'm trying to use a RBD to act as a staging area for some data before > > pushing > >> it down to some LTO6 tapes. As I cannot use striping with the kernel > > client I > >> tend to be maxing out at around 80MB/s reads testing with DD. Has > >> anyone got any clever suggestions of giving this a bit of a boost, I > >> think I need > > to get it > >> up to around 200MB/s to make sure there is always a steady flow of > >> data to the tape drive. > > > > I've just tried the testing kernel with the blk-mq fixes in it for > > full size IO's, this combined with bumping readahead up to 4MB, is now > > getting me on average 150MB/s to 200MB/s so this might suffice. > > > > On a personal interest, I would still like to know if anyone has ideas > > on how to really push much higher bandwidth through a RBD. > > Some settings in our ceph.conf that may help: > > osd_op_threads = 20 > osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k > filestore_queue_max_ops = 9 filestore_flusher = false > filestore_max_sync_interval = 10 filestore_sync_flush = false > > Regards, > Alex > > > > >> > >> Rbd-fuse seems to top out at 12MB/s, so there goes that option. > >> > >> I'm thinking mapping multiple RBD's and then combining them into a > >> mdadm > >> RAID0 stripe might work, but seems a bit messy. > >> > >> Any suggestions? > >> > >> Thanks, > >> Nick > >> > > > > > > > > > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this message is not the intended recipient, you are hereby notified > that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly prohibited. If > you have received this communication in error, please notify the sender by > telephone or e-mail (as shown above) immediately and destroy any and all > copies of this message in your possession (whether hard copies or > electronically stored copies). > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to improve single thread sequential reads?
Have you tried setting read_ahead_kb to bigger number for both client/OSD side if you are using krbd ? In case of librbd, try the different config options for rbd cache.. Thanks & Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Alex Gorbachev Sent: Sunday, August 16, 2015 7:07 PM To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] How to improve single thread sequential reads? Hi Nick, On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk wrote: >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >> Of Nick Fisk >> Sent: 13 August 2015 18:04 >> To: ceph-users@lists.ceph.com >> Subject: [ceph-users] How to improve single thread sequential reads? >> >> Hi, >> >> I'm trying to use a RBD to act as a staging area for some data before > pushing >> it down to some LTO6 tapes. As I cannot use striping with the kernel > client I >> tend to be maxing out at around 80MB/s reads testing with DD. Has >> anyone got any clever suggestions of giving this a bit of a boost, I >> think I need > to get it >> up to around 200MB/s to make sure there is always a steady flow of >> data to the tape drive. > > I've just tried the testing kernel with the blk-mq fixes in it for > full size IO's, this combined with bumping readahead up to 4MB, is now > getting me on average 150MB/s to 200MB/s so this might suffice. > > On a personal interest, I would still like to know if anyone has ideas > on how to really push much higher bandwidth through a RBD. Some settings in our ceph.conf that may help: osd_op_threads = 20 osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k filestore_queue_max_ops = 9 filestore_flusher = false filestore_max_sync_interval = 10 filestore_sync_flush = false Regards, Alex > >> >> Rbd-fuse seems to top out at 12MB/s, so there goes that option. >> >> I'm thinking mapping multiple RBD's and then combining them into a >> mdadm >> RAID0 stripe might work, but seems a bit messy. >> >> Any suggestions? >> >> Thanks, >> Nick >> > > > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to improve single thread sequential reads?
Hi Nick, On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk wrote: >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Nick Fisk >> Sent: 13 August 2015 18:04 >> To: ceph-users@lists.ceph.com >> Subject: [ceph-users] How to improve single thread sequential reads? >> >> Hi, >> >> I'm trying to use a RBD to act as a staging area for some data before > pushing >> it down to some LTO6 tapes. As I cannot use striping with the kernel > client I >> tend to be maxing out at around 80MB/s reads testing with DD. Has anyone >> got any clever suggestions of giving this a bit of a boost, I think I need > to get it >> up to around 200MB/s to make sure there is always a steady flow of data to >> the tape drive. > > I've just tried the testing kernel with the blk-mq fixes in it for full size > IO's, this combined with bumping readahead up to 4MB, is now getting me on > average 150MB/s to 200MB/s so this might suffice. > > On a personal interest, I would still like to know if anyone has ideas on > how to really push much higher bandwidth through a RBD. Some settings in our ceph.conf that may help: osd_op_threads = 20 osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k filestore_queue_max_ops = 9 filestore_flusher = false filestore_max_sync_interval = 10 filestore_sync_flush = false Regards, Alex > >> >> Rbd-fuse seems to top out at 12MB/s, so there goes that option. >> >> I'm thinking mapping multiple RBD's and then combining them into a mdadm >> RAID0 stripe might work, but seems a bit messy. >> >> Any suggestions? >> >> Thanks, >> Nick >> > > > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to improve single thread sequential reads?
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Nick Fisk > Sent: 13 August 2015 18:04 > To: ceph-users@lists.ceph.com > Subject: [ceph-users] How to improve single thread sequential reads? > > Hi, > > I'm trying to use a RBD to act as a staging area for some data before pushing > it down to some LTO6 tapes. As I cannot use striping with the kernel client I > tend to be maxing out at around 80MB/s reads testing with DD. Has anyone > got any clever suggestions of giving this a bit of a boost, I think I need to get it > up to around 200MB/s to make sure there is always a steady flow of data to > the tape drive. I've just tried the testing kernel with the blk-mq fixes in it for full size IO's, this combined with bumping readahead up to 4MB, is now getting me on average 150MB/s to 200MB/s so this might suffice. On a personal interest, I would still like to know if anyone has ideas on how to really push much higher bandwidth through a RBD. > > Rbd-fuse seems to top out at 12MB/s, so there goes that option. > > I'm thinking mapping multiple RBD's and then combining them into a mdadm > RAID0 stripe might work, but seems a bit messy. > > Any suggestions? > > Thanks, > Nick > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to improve single thread sequential reads?
Hi, I'm trying to use a RBD to act as a staging area for some data before pushing it down to some LTO6 tapes. As I cannot use striping with the kernel client I tend to be maxing out at around 80MB/s reads testing with DD. Has anyone got any clever suggestions of giving this a bit of a boost, I think I need to get it up to around 200MB/s to make sure there is always a steady flow of data to the tape drive. Rbd-fuse seems to top out at 12MB/s, so there goes that option. I'm thinking mapping multiple RBD's and then combining them into a mdadm RAID0 stripe might work, but seems a bit messy. Any suggestions? Thanks, Nick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com