Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Jan Schermer

> On 18 Aug 2015, at 13:58, Nick Fisk  wrote:
> 
> 
> 
> 
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Jan Schermer
>> Sent: 18 August 2015 12:41
>> To: Nick Fisk 
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] How to improve single thread sequential reads?
>> 
>> Reply in text
>> 
>>> On 18 Aug 2015, at 12:59, Nick Fisk  wrote:
>>> 
>>> 
>>> 
>>>> -Original Message-
>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>>>> Of Jan Schermer
>>>> Sent: 18 August 2015 11:50
>>>> To: Benedikt Fraunhofer >>> users.ceph.com.toasta@traced.net>
>>>> Cc: ceph-users@lists.ceph.com; Nick Fisk 
>>>> Subject: Re: [ceph-users] How to improve single thread sequential
> reads?
>>>> 
>>>> I'm not sure if I missed that but are you testing in a VM backed by
>>>> RBD device, or using the device directly?
>>>> 
>>>> I don't see how blk-mq would help if it's not a VM, it just passes
>>>> the
>>> request
>>>> to the underlying block device, and in case of RBD there is no real
>>>> block device from the host perspective...? Enlighten me if I'm wrong
>>>> please. I
>>> have
>>>> some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me
>>>> cringe because I'm unable to tune the scheduler and it just makes no
>>>> sense at all...?
>>> 
>>> Since 4.0 (I think) the Kernel RBD client now uses the blk-mq
>>> infrastructure, but there is a bug which limits max IO sizes to 128kb,
>>> which is why for large block/sequential that testing kernel is
>>> essential. I think this bug fix should make it to 4.2 hopefully.
>> 
>> blk-mq is supposed to remove redundancy of having
>> 
>> IO scheduler in VM -> VM block device -> host IO scheduler -> block device
>> 
>> it's a paravirtualized driver that just moves requests from inside the VM
> to
>> the host queue (and this is why inside the VM you have no IO scheduler
>> options - it effectively becomes noop).
>> 
>> But this just doesn't make sense if you're using qemu with librbd -
> there's no
>> host queue.
>> It would make sense if the qemu drive was krbd device with a queue.
>> 
>> If there's no VM there should be no blk-mq?
> 
> I think you might be thinking about the virtio-blk driver for blk-mq. Blk-mq
> itself seems to be a lot more about enhancing the overall block layer
> performance in Linux
> 
> https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mec
> hanism_(blk-mq)
> 
> 
> 
>> 
>> So what was added to the kernel was probably the host-side infrastructure
>> to handle blk-mq in guest passthrough to the krdb device, but that's
> probably
>> not your case, is it?
>> 
>>> 
>>>> 
>>>> Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb
>>>> (to make sure it gets into readahead), also try (if you're not using
>>>> blk-mq)
>>> to a
>>>> cfq scheduler and set it to rotational=1. I see you've also tried
>>>> this,
>>> but I think
>>>> blk-mq is the limiting factor here now.
>>> 
>>> I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals
>>> object size, from what I can tell) and the max_sectors_kb is already
>>> set at the hw_max. But it would sure be nice if the max_hw_sectors_kb
>>> could be set higher though, but I'm not sure if there is a reason for
> this
>> limit.
>>> 
>>>> 
>>>> If you are running a single-threaded benchmark like rados bench then
>>> what's
>>>> limiting you is latency - it's not surprising it scales up with more
>>> threads.
>>> 
>>> Agreed, but with sequential workloads, if you can get readahead
>>> working properly then you can easily remove this limitation as a
>>> single threaded op effectively becomes multithreaded.
>> 
>> Thinking on this more - I don't know if this will help after all, it will
> still be a
>> single thread, just trying to get ahead of the client IO - and that's not
> likely to
>> happen unless you can read the data in userspace slower than what Ceph
>> can read...
>> 
>> I thi

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Nick Fisk




> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Jan Schermer
> Sent: 18 August 2015 12:41
> To: Nick Fisk 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] How to improve single thread sequential reads?
> 
> Reply in text
> 
> > On 18 Aug 2015, at 12:59, Nick Fisk  wrote:
> >
> >
> >
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Jan Schermer
> >> Sent: 18 August 2015 11:50
> >> To: Benedikt Fraunhofer  >> users.ceph.com.toasta@traced.net>
> >> Cc: ceph-users@lists.ceph.com; Nick Fisk 
> >> Subject: Re: [ceph-users] How to improve single thread sequential
reads?
> >>
> >> I'm not sure if I missed that but are you testing in a VM backed by
> >> RBD device, or using the device directly?
> >>
> >> I don't see how blk-mq would help if it's not a VM, it just passes
> >> the
> > request
> >> to the underlying block device, and in case of RBD there is no real
> >> block device from the host perspective...? Enlighten me if I'm wrong
> >> please. I
> > have
> >> some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me
> >> cringe because I'm unable to tune the scheduler and it just makes no
> >> sense at all...?
> >
> > Since 4.0 (I think) the Kernel RBD client now uses the blk-mq
> > infrastructure, but there is a bug which limits max IO sizes to 128kb,
> > which is why for large block/sequential that testing kernel is
> > essential. I think this bug fix should make it to 4.2 hopefully.
> 
> blk-mq is supposed to remove redundancy of having
> 
> IO scheduler in VM -> VM block device -> host IO scheduler -> block device
> 
> it's a paravirtualized driver that just moves requests from inside the VM
to
> the host queue (and this is why inside the VM you have no IO scheduler
> options - it effectively becomes noop).
> 
> But this just doesn't make sense if you're using qemu with librbd -
there's no
> host queue.
> It would make sense if the qemu drive was krbd device with a queue.
> 
> If there's no VM there should be no blk-mq?

I think you might be thinking about the virtio-blk driver for blk-mq. Blk-mq
itself seems to be a lot more about enhancing the overall block layer
performance in Linux

https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mec
hanism_(blk-mq)



> 
> So what was added to the kernel was probably the host-side infrastructure
> to handle blk-mq in guest passthrough to the krdb device, but that's
probably
> not your case, is it?
> 
> >
> >>
> >> Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb
> >> (to make sure it gets into readahead), also try (if you're not using
> >> blk-mq)
> > to a
> >> cfq scheduler and set it to rotational=1. I see you've also tried
> >> this,
> > but I think
> >> blk-mq is the limiting factor here now.
> >
> > I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals
> > object size, from what I can tell) and the max_sectors_kb is already
> > set at the hw_max. But it would sure be nice if the max_hw_sectors_kb
> > could be set higher though, but I'm not sure if there is a reason for
this
> limit.
> >
> >>
> >> If you are running a single-threaded benchmark like rados bench then
> > what's
> >> limiting you is latency - it's not surprising it scales up with more
> > threads.
> >
> > Agreed, but with sequential workloads, if you can get readahead
> > working properly then you can easily remove this limitation as a
> > single threaded op effectively becomes multithreaded.
> 
> Thinking on this more - I don't know if this will help after all, it will
still be a
> single thread, just trying to get ahead of the client IO - and that's not
likely to
> happen unless you can read the data in userspace slower than what Ceph
> can read...
> 
> I think striping multiple device could be the answer after all. But have
you
> tried creating the RBD volume as striped in Ceph?

Yes striping would probably give amazing performance, but the kernel client
currently doesn't support it, which leaves us in the position of trying to
find work arounds to boost performance.

Although the client read is single threaded, the RBD/RADOS layer would split
these larger readahead IOs into 4MB requests that would then be processed in
parallel by 

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Jan Schermer
Reply in text

> On 18 Aug 2015, at 12:59, Nick Fisk  wrote:
> 
> 
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Jan Schermer
>> Sent: 18 August 2015 11:50
>> To: Benedikt Fraunhofer > users.ceph.com.toasta@traced.net>
>> Cc: ceph-users@lists.ceph.com; Nick Fisk 
>> Subject: Re: [ceph-users] How to improve single thread sequential reads?
>> 
>> I'm not sure if I missed that but are you testing in a VM backed by RBD
>> device, or using the device directly?
>> 
>> I don't see how blk-mq would help if it's not a VM, it just passes the
> request
>> to the underlying block device, and in case of RBD there is no real block
>> device from the host perspective...? Enlighten me if I'm wrong please. I
> have
>> some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me
>> cringe because I'm unable to tune the scheduler and it just makes no sense
>> at all...?
> 
> Since 4.0 (I think) the Kernel RBD client now uses the blk-mq
> infrastructure, but there is a bug which limits max IO sizes to 128kb, which
> is why for large block/sequential that testing kernel is essential. I think
> this bug fix should make it to 4.2 hopefully.

blk-mq is supposed to remove redundancy of having

IO scheduler in VM -> VM block device -> host IO scheduler -> block device

it's a paravirtualized driver that just moves requests from inside the VM to 
the host queue (and this is why inside the VM you have no IO scheduler options 
- it effectively becomes noop).

But this just doesn't make sense if you're using qemu with librbd - there's no 
host queue.
It would make sense if the qemu drive was krbd device with a queue.

If there's no VM there should be no blk-mq?

So what was added to the kernel was probably the host-side infrastructure to 
handle blk-mq in guest passthrough to the krdb device, but that's probably not 
your case, is it?

> 
>> 
>> Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to
>> make sure it gets into readahead), also try (if you're not using blk-mq)
> to a
>> cfq scheduler and set it to rotational=1. I see you've also tried this,
> but I think
>> blk-mq is the limiting factor here now.
> 
> I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals object
> size, from what I can tell) and the max_sectors_kb is already set at the
> hw_max. But it would sure be nice if the max_hw_sectors_kb could be set
> higher though, but I'm not sure if there is a reason for this limit.
> 
>> 
>> If you are running a single-threaded benchmark like rados bench then
> what's
>> limiting you is latency - it's not surprising it scales up with more
> threads.
> 
> Agreed, but with sequential workloads, if you can get readahead working
> properly then you can easily remove this limitation as a single threaded op
> effectively becomes multithreaded.

Thinking on this more - I don't know if this will help after all, it will still 
be a single thread, just trying to get ahead of the client IO - and that's not 
likely to happen unless you can read the data in userspace slower than what 
Ceph can read...

I think striping multiple device could be the answer after all. But have you 
tried creating the RBD volume as striped in Ceph?

> 
>> It should run nicely with a real workload once readahead kicks in and the
>> queue fills up. But again - not sure how that works with blk-mq and I've
>> never used the RBD device directly (the kernel client). Does it show in
>> /sys/block ? Can you dump "find /sys/block/$rbd" in here?
>> 
>> Jan
>> 
>> 
>>> On 18 Aug 2015, at 12:25, Benedikt Fraunhofer > users.ceph.com.toasta@traced.net> wrote:
>>> 
>>> Hi Nick,
>>> 
>>> did you do anything fancy to get to ~90MB/s in the first place?
>>> I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
>>> quite speedy, around 600MB/s.
>>> 
>>> radosgw for cold data is around the 90MB/s, which is imho limitted by
>>> the speed of a single disk.
>>> 
>>> Data already present on the osd-os-buffers arrive with around
>>> 400-700MB/s so I don't think the network is the culprit.
>>> 
>>> (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
>>> each, lacp 2x10g bonds)
>>> 
>>> rados bench single-threaded performs equally bad, but with its default
>>> multithreaded settings it generates wonderful numbers, usually only
>>> limiited by lin

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Nick Fisk


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Jan Schermer
> Sent: 18 August 2015 11:50
> To: Benedikt Fraunhofer  users.ceph.com.toasta@traced.net>
> Cc: ceph-users@lists.ceph.com; Nick Fisk 
> Subject: Re: [ceph-users] How to improve single thread sequential reads?
> 
> I'm not sure if I missed that but are you testing in a VM backed by RBD
> device, or using the device directly?
> 
> I don't see how blk-mq would help if it's not a VM, it just passes the
request
> to the underlying block device, and in case of RBD there is no real block
> device from the host perspective...? Enlighten me if I'm wrong please. I
have
> some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me
> cringe because I'm unable to tune the scheduler and it just makes no sense
> at all...?

Since 4.0 (I think) the Kernel RBD client now uses the blk-mq
infrastructure, but there is a bug which limits max IO sizes to 128kb, which
is why for large block/sequential that testing kernel is essential. I think
this bug fix should make it to 4.2 hopefully.

> 
> Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to
> make sure it gets into readahead), also try (if you're not using blk-mq)
to a
> cfq scheduler and set it to rotational=1. I see you've also tried this,
but I think
> blk-mq is the limiting factor here now.

I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals object
size, from what I can tell) and the max_sectors_kb is already set at the
hw_max. But it would sure be nice if the max_hw_sectors_kb could be set
higher though, but I'm not sure if there is a reason for this limit.

> 
> If you are running a single-threaded benchmark like rados bench then
what's
> limiting you is latency - it's not surprising it scales up with more
threads.

Agreed, but with sequential workloads, if you can get readahead working
properly then you can easily remove this limitation as a single threaded op
effectively becomes multithreaded.

> It should run nicely with a real workload once readahead kicks in and the
> queue fills up. But again - not sure how that works with blk-mq and I've
> never used the RBD device directly (the kernel client). Does it show in
> /sys/block ? Can you dump "find /sys/block/$rbd" in here?
> 
> Jan
> 
> 
> > On 18 Aug 2015, at 12:25, Benedikt Fraunhofer  users.ceph.com.toasta@traced.net> wrote:
> >
> > Hi Nick,
> >
> > did you do anything fancy to get to ~90MB/s in the first place?
> > I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
> > quite speedy, around 600MB/s.
> >
> > radosgw for cold data is around the 90MB/s, which is imho limitted by
> > the speed of a single disk.
> >
> > Data already present on the osd-os-buffers arrive with around
> > 400-700MB/s so I don't think the network is the culprit.
> >
> > (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
> > each, lacp 2x10g bonds)
> >
> > rados bench single-threaded performs equally bad, but with its default
> > multithreaded settings it generates wonderful numbers, usually only
> > limiited by linerate and/or interrupts/s.
> >
> > I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to
> > get to "your wonderful" numbers, but it's staying below 30 MB/s.
> >
> > I was thinking about using a software raid0 like you did but that's
> > imho really ugly.
> > When I know I needed something speedy, I usually just started dd-ing
> > the file to /dev/null and wait for about  three minutes before
> > starting the actual job; some sort of hand-made read-ahead for
> > dummies.
> >
> > Thx in advance
> >  Benedikt
> >
> >
> > 2015-08-17 13:29 GMT+02:00 Nick Fisk :
> >> Thanks for the replies guys.
> >>
> >> The client is set to 4MB, I haven't played with the OSD side yet as I
> >> wasn't sure if it would make much difference, but I will give it a
> >> go. If the client is already passing a 4MB request down through to
> >> the OSD, will it be able to readahead any further? The next 4MB
> >> object in theory will be on another OSD and so I'm not sure if
> >> reading ahead any further on the OSD side would help.
> >>
> >> How I see the problem is that the RBD client will only read 1 OSD at
> >> a time as the RBD readahead can't be set any higher than
> >> max_hw_sectors_kb, which is the object size of the RBD. Please correct
> me if I'm wrong on this.
> &

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Jan Schermer
I'm not sure if I missed that but are you testing in a VM backed by RBD device, 
or using the device directly?

I don't see how blk-mq would help if it's not a VM, it just passes the request 
to the underlying block device, and in case of RBD there is no real block 
device from the host perspective...? Enlighten me if I'm wrong please. I have 
some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me cringe 
because I'm unable to tune the scheduler and it just makes no sense at all...?

Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to make 
sure it gets into readahead), also try (if you're not using blk-mq) to a cfq 
scheduler and set it to rotational=1. I see you've also tried this, but I think 
blk-mq is the limiting factor here now.

If you are running a single-threaded benchmark like rados bench then what's 
limiting you is latency - it's not surprising it scales up with more threads.
It should run nicely with a real workload once readahead kicks in and the queue 
fills up. But again - not sure how that works with blk-mq and I've never used 
the RBD device directly (the kernel client). Does it show in /sys/block ? Can 
you dump "find /sys/block/$rbd" in here?

Jan


> On 18 Aug 2015, at 12:25, Benedikt Fraunhofer 
>  wrote:
> 
> Hi Nick,
> 
> did you do anything fancy to get to ~90MB/s in the first place?
> I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
> quite speedy, around 600MB/s.
> 
> radosgw for cold data is around the 90MB/s, which is imho limitted by
> the speed of a single disk.
> 
> Data already present on the osd-os-buffers arrive with around
> 400-700MB/s so I don't think the network is the culprit.
> 
> (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
> each, lacp 2x10g bonds)
> 
> rados bench single-threaded performs equally bad, but with its default
> multithreaded settings it generates wonderful numbers, usually only
> limiited by linerate and/or interrupts/s.
> 
> I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to
> get to "your wonderful" numbers, but it's staying below 30 MB/s.
> 
> I was thinking about using a software raid0 like you did but that's
> imho really ugly.
> When I know I needed something speedy, I usually just started dd-ing
> the file to /dev/null and wait for about  three minutes before
> starting the actual job; some sort of hand-made read-ahead for
> dummies.
> 
> Thx in advance
>  Benedikt
> 
> 
> 2015-08-17 13:29 GMT+02:00 Nick Fisk :
>> Thanks for the replies guys.
>> 
>> The client is set to 4MB, I haven't played with the OSD side yet as I wasn't
>> sure if it would make much difference, but I will give it a go. If the
>> client is already passing a 4MB request down through to the OSD, will it be
>> able to readahead any further? The next 4MB object in theory will be on
>> another OSD and so I'm not sure if reading ahead any further on the OSD side
>> would help.
>> 
>> How I see the problem is that the RBD client will only read 1 OSD at a time
>> as the RBD readahead can't be set any higher than max_hw_sectors_kb, which
>> is the object size of the RBD. Please correct me if I'm wrong on this.
>> 
>> If you could set the RBD readahead to much higher than the object size, then
>> this would probably give the desired effect where the buffer could be
>> populated by reading from several OSD's in advance to give much higher
>> performance. That or wait for striping to appear in the Kernel client.
>> 
>> I've also found that BareOS (fork of Bacula) seems to has a direct RADOS
>> feature that supports radosstriper. I might try this and see how it performs
>> as well.
>> 
>> 
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>> Somnath Roy
>>> Sent: 17 August 2015 03:36
>>> To: Alex Gorbachev ; Nick Fisk 
>>> Cc: ceph-users@lists.ceph.com
>>> Subject: Re: [ceph-users] How to improve single thread sequential reads?
>>> 
>>> Have you tried setting read_ahead_kb to bigger number for both client/OSD
>>> side if you are using krbd ?
>>> In case of librbd, try the different config options for rbd cache..
>>> 
>>> Thanks & Regards
>>> Somnath
>>> 
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>> Alex Gorbachev
>>> Sent: Sunday, August 16, 2015 7:07 PM
>>> To: Nick Fisk
>>> Cc: ceph-users@lists.ceph.com

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Benedikt Fraunhofer
> Sent: 18 August 2015 11:25
> To: Nick Fisk 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] How to improve single thread sequential reads?
> 
> Hi Nick,
> 
> did you do anything fancy to get to ~90MB/s in the first place?
> I'm stuck at ~30MB/s reading cold data. single-threaded-writes are quite
> speedy, around 600MB/s.

I only bumped up the read ahead to 4096, apart from that I didn't change
anything else. This was probably done on a reasonably quite cluster, if the
cluster is doing other things sequential IO is normally the 1st to suffer. 

However please look for a thread I started a few months ago where I was
getting very poor performance in reading data that had been sitting dormant
for a while. It turned out to be something to do with taking a long time to
retrieve xattrs, but unfortunately I never got to the bottom of it. I don't
know if this is something you might also be experiencing?

> 
> radosgw for cold data is around the 90MB/s, which is imho limitted by the
> speed of a single disk.
> 
> Data already present on the osd-os-buffers arrive with around 400-700MB/s
> so I don't think the network is the culprit.
> 
> (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds each,
lacp
> 2x10g bonds)
> 
> rados bench single-threaded performs equally bad, but with its default
> multithreaded settings it generates wonderful numbers, usually only
limiited
> by linerate and/or interrupts/s.
> 
> I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to get
to
> "your wonderful" numbers, but it's staying below 30 MB/s.


You will need this testing kernel for the blk-mq fixes, anything other than
that at the moment will limit your max IO size.
http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/testing_blk-m
q-plug/


> 
> I was thinking about using a software raid0 like you did but that's imho
really
> ugly.
> When I know I needed something speedy, I usually just started dd-ing the
> file to /dev/null and wait for about  three minutes before starting the
actual
> job; some sort of hand-made read-ahead for dummies.
> 
> Thx in advance
>   Benedikt
> 
> 
> 2015-08-17 13:29 GMT+02:00 Nick Fisk :
> > Thanks for the replies guys.
> >
> > The client is set to 4MB, I haven't played with the OSD side yet as I
> > wasn't sure if it would make much difference, but I will give it a go.
> > If the client is already passing a 4MB request down through to the
> > OSD, will it be able to readahead any further? The next 4MB object in
> > theory will be on another OSD and so I'm not sure if reading ahead any
> > further on the OSD side would help.
> >
> > How I see the problem is that the RBD client will only read 1 OSD at a
> > time as the RBD readahead can't be set any higher than
> > max_hw_sectors_kb, which is the object size of the RBD. Please correct
me
> if I'm wrong on this.
> >
> > If you could set the RBD readahead to much higher than the object
> > size, then this would probably give the desired effect where the
> > buffer could be populated by reading from several OSD's in advance to
> > give much higher performance. That or wait for striping to appear in the
> Kernel client.
> >
> > I've also found that BareOS (fork of Bacula) seems to has a direct
> > RADOS feature that supports radosstriper. I might try this and see how
> > it performs as well.
> >
> >
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Somnath Roy
> >> Sent: 17 August 2015 03:36
> >> To: Alex Gorbachev ; Nick Fisk
> >> 
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] How to improve single thread sequential
reads?
> >>
> >> Have you tried setting read_ahead_kb to bigger number for both
> >> client/OSD side if you are using krbd ?
> >> In case of librbd, try the different config options for rbd cache..
> >>
> >> Thanks & Regards
> >> Somnath
> >>
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Alex Gorbachev
> >> Sent: Sunday, August 16, 2015 7:07 PM
> >> To: Nick Fisk
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] How to improve single thread sequential
reads?
> >>
> >> Hi Nick,
> >>
> >> On Thu, Aug 13, 2015 at 4:37 PM, Nic

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Wido den Hollander


On 18-08-15 12:25, Benedikt Fraunhofer wrote:
> Hi Nick,
> 
> did you do anything fancy to get to ~90MB/s in the first place?
> I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
> quite speedy, around 600MB/s.
> 
> radosgw for cold data is around the 90MB/s, which is imho limitted by
> the speed of a single disk.
> 
> Data already present on the osd-os-buffers arrive with around
> 400-700MB/s so I don't think the network is the culprit.
> 
> (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
> each, lacp 2x10g bonds)
> 
> rados bench single-threaded performs equally bad, but with its default
> multithreaded settings it generates wonderful numbers, usually only
> limiited by linerate and/or interrupts/s.
> 
> I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to
> get to "your wonderful" numbers, but it's staying below 30 MB/s.
> 
> I was thinking about using a software raid0 like you did but that's
> imho really ugly.
> When I know I needed something speedy, I usually just started dd-ing
> the file to /dev/null and wait for about  three minutes before
> starting the actual job; some sort of hand-made read-ahead for
> dummies.
> 

It really depends on your situation, but you could also go for larger
objects then 4MB for specific block devices.

In a use-case with a customer where they read large single-thread files
from RBD block devices we went for 64MB objects.

That improved our read performance in that case. We didn't have to
create a new TCP connection every 4MB and talk to a new OSD.

You could try that and see how it works out.

Wido

> Thx in advance
>   Benedikt
> 
> 
> 2015-08-17 13:29 GMT+02:00 Nick Fisk :
>> Thanks for the replies guys.
>>
>> The client is set to 4MB, I haven't played with the OSD side yet as I wasn't
>> sure if it would make much difference, but I will give it a go. If the
>> client is already passing a 4MB request down through to the OSD, will it be
>> able to readahead any further? The next 4MB object in theory will be on
>> another OSD and so I'm not sure if reading ahead any further on the OSD side
>> would help.
>>
>> How I see the problem is that the RBD client will only read 1 OSD at a time
>> as the RBD readahead can't be set any higher than max_hw_sectors_kb, which
>> is the object size of the RBD. Please correct me if I'm wrong on this.
>>
>> If you could set the RBD readahead to much higher than the object size, then
>> this would probably give the desired effect where the buffer could be
>> populated by reading from several OSD's in advance to give much higher
>> performance. That or wait for striping to appear in the Kernel client.
>>
>> I've also found that BareOS (fork of Bacula) seems to has a direct RADOS
>> feature that supports radosstriper. I might try this and see how it performs
>> as well.
>>
>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>> Somnath Roy
>>> Sent: 17 August 2015 03:36
>>> To: Alex Gorbachev ; Nick Fisk 
>>> Cc: ceph-users@lists.ceph.com
>>> Subject: Re: [ceph-users] How to improve single thread sequential reads?
>>>
>>> Have you tried setting read_ahead_kb to bigger number for both client/OSD
>>> side if you are using krbd ?
>>> In case of librbd, try the different config options for rbd cache..
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>> Alex Gorbachev
>>> Sent: Sunday, August 16, 2015 7:07 PM
>>> To: Nick Fisk
>>> Cc: ceph-users@lists.ceph.com
>>> Subject: Re: [ceph-users] How to improve single thread sequential reads?
>>>
>>> Hi Nick,
>>>
>>> On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk  wrote:
>>>>> -Original Message-
>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>>>>> Of Nick Fisk
>>>>> Sent: 13 August 2015 18:04
>>>>> To: ceph-users@lists.ceph.com
>>>>> Subject: [ceph-users] How to improve single thread sequential reads?
>>>>>
>>>>> Hi,
>>>>>
>>>>> I'm trying to use a RBD to act as a staging area for some data before
>>>> pushing
>>>>> it down to some LTO6 tapes. As I cannot use striping with the kernel
>>>> client I
>>>>> tend to 

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Benedikt Fraunhofer
Hi Nick,

did you do anything fancy to get to ~90MB/s in the first place?
I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
quite speedy, around 600MB/s.

radosgw for cold data is around the 90MB/s, which is imho limitted by
the speed of a single disk.

Data already present on the osd-os-buffers arrive with around
400-700MB/s so I don't think the network is the culprit.

(20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
each, lacp 2x10g bonds)

rados bench single-threaded performs equally bad, but with its default
multithreaded settings it generates wonderful numbers, usually only
limiited by linerate and/or interrupts/s.

I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to
get to "your wonderful" numbers, but it's staying below 30 MB/s.

I was thinking about using a software raid0 like you did but that's
imho really ugly.
When I know I needed something speedy, I usually just started dd-ing
the file to /dev/null and wait for about  three minutes before
starting the actual job; some sort of hand-made read-ahead for
dummies.

Thx in advance
  Benedikt


2015-08-17 13:29 GMT+02:00 Nick Fisk :
> Thanks for the replies guys.
>
> The client is set to 4MB, I haven't played with the OSD side yet as I wasn't
> sure if it would make much difference, but I will give it a go. If the
> client is already passing a 4MB request down through to the OSD, will it be
> able to readahead any further? The next 4MB object in theory will be on
> another OSD and so I'm not sure if reading ahead any further on the OSD side
> would help.
>
> How I see the problem is that the RBD client will only read 1 OSD at a time
> as the RBD readahead can't be set any higher than max_hw_sectors_kb, which
> is the object size of the RBD. Please correct me if I'm wrong on this.
>
> If you could set the RBD readahead to much higher than the object size, then
> this would probably give the desired effect where the buffer could be
> populated by reading from several OSD's in advance to give much higher
> performance. That or wait for striping to appear in the Kernel client.
>
> I've also found that BareOS (fork of Bacula) seems to has a direct RADOS
> feature that supports radosstriper. I might try this and see how it performs
> as well.
>
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Somnath Roy
>> Sent: 17 August 2015 03:36
>> To: Alex Gorbachev ; Nick Fisk 
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] How to improve single thread sequential reads?
>>
>> Have you tried setting read_ahead_kb to bigger number for both client/OSD
>> side if you are using krbd ?
>> In case of librbd, try the different config options for rbd cache..
>>
>> Thanks & Regards
>> Somnath
>>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Alex Gorbachev
>> Sent: Sunday, August 16, 2015 7:07 PM
>> To: Nick Fisk
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] How to improve single thread sequential reads?
>>
>> Hi Nick,
>>
>> On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk  wrote:
>> >> -Original Message-
>> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>> >> Of Nick Fisk
>> >> Sent: 13 August 2015 18:04
>> >> To: ceph-users@lists.ceph.com
>> >> Subject: [ceph-users] How to improve single thread sequential reads?
>> >>
>> >> Hi,
>> >>
>> >> I'm trying to use a RBD to act as a staging area for some data before
>> > pushing
>> >> it down to some LTO6 tapes. As I cannot use striping with the kernel
>> > client I
>> >> tend to be maxing out at around 80MB/s reads testing with DD. Has
>> >> anyone got any clever suggestions of giving this a bit of a boost, I
>> >> think I need
>> > to get it
>> >> up to around 200MB/s to make sure there is always a steady flow of
>> >> data to the tape drive.
>> >
>> > I've just tried the testing kernel with the blk-mq fixes in it for
>> > full size IO's, this combined with bumping readahead up to 4MB, is now
>> > getting me on average 150MB/s to 200MB/s so this might suffice.
>> >
>> > On a personal interest, I would still like to know if anyone has ideas
>> > on how to really push much higher bandwidth through a RBD.
>>
>> Some settings in our ceph.conf that may help:
>>
>> osd_op_threads = 20
>> o

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-17 Thread Nick Fisk
Thanks for the replies guys.

The client is set to 4MB, I haven't played with the OSD side yet as I wasn't
sure if it would make much difference, but I will give it a go. If the
client is already passing a 4MB request down through to the OSD, will it be
able to readahead any further? The next 4MB object in theory will be on
another OSD and so I'm not sure if reading ahead any further on the OSD side
would help.

How I see the problem is that the RBD client will only read 1 OSD at a time
as the RBD readahead can't be set any higher than max_hw_sectors_kb, which
is the object size of the RBD. Please correct me if I'm wrong on this.

If you could set the RBD readahead to much higher than the object size, then
this would probably give the desired effect where the buffer could be
populated by reading from several OSD's in advance to give much higher
performance. That or wait for striping to appear in the Kernel client.

I've also found that BareOS (fork of Bacula) seems to has a direct RADOS
feature that supports radosstriper. I might try this and see how it performs
as well.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Somnath Roy
> Sent: 17 August 2015 03:36
> To: Alex Gorbachev ; Nick Fisk 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] How to improve single thread sequential reads?
> 
> Have you tried setting read_ahead_kb to bigger number for both client/OSD
> side if you are using krbd ?
> In case of librbd, try the different config options for rbd cache..
> 
> Thanks & Regards
> Somnath
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Alex Gorbachev
> Sent: Sunday, August 16, 2015 7:07 PM
> To: Nick Fisk
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] How to improve single thread sequential reads?
> 
> Hi Nick,
> 
> On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk  wrote:
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Nick Fisk
> >> Sent: 13 August 2015 18:04
> >> To: ceph-users@lists.ceph.com
> >> Subject: [ceph-users] How to improve single thread sequential reads?
> >>
> >> Hi,
> >>
> >> I'm trying to use a RBD to act as a staging area for some data before
> > pushing
> >> it down to some LTO6 tapes. As I cannot use striping with the kernel
> > client I
> >> tend to be maxing out at around 80MB/s reads testing with DD. Has
> >> anyone got any clever suggestions of giving this a bit of a boost, I
> >> think I need
> > to get it
> >> up to around 200MB/s to make sure there is always a steady flow of
> >> data to the tape drive.
> >
> > I've just tried the testing kernel with the blk-mq fixes in it for
> > full size IO's, this combined with bumping readahead up to 4MB, is now
> > getting me on average 150MB/s to 200MB/s so this might suffice.
> >
> > On a personal interest, I would still like to know if anyone has ideas
> > on how to really push much higher bandwidth through a RBD.
> 
> Some settings in our ceph.conf that may help:
> 
> osd_op_threads = 20
> osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k
> filestore_queue_max_ops = 9 filestore_flusher = false
> filestore_max_sync_interval = 10 filestore_sync_flush = false
> 
> Regards,
> Alex
> 
> >
> >>
> >> Rbd-fuse seems to top out at 12MB/s, so there goes that option.
> >>
> >> I'm thinking mapping multiple RBD's and then combining them into a
> >> mdadm
> >> RAID0 stripe might work, but seems a bit messy.
> >>
> >> Any suggestions?
> >>
> >> Thanks,
> >> Nick
> >>
> >
> >
> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If
the
> reader of this message is not the intended recipient, you are hereby
notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to improve single thread sequential reads?

2015-08-16 Thread Somnath Roy
Have you tried setting read_ahead_kb to bigger number for both client/OSD side 
if you are using krbd ?
In case of librbd, try the different config options for rbd cache..

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Alex 
Gorbachev
Sent: Sunday, August 16, 2015 7:07 PM
To: Nick Fisk
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] How to improve single thread sequential reads?

Hi Nick,

On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk  wrote:
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>> Of Nick Fisk
>> Sent: 13 August 2015 18:04
>> To: ceph-users@lists.ceph.com
>> Subject: [ceph-users] How to improve single thread sequential reads?
>>
>> Hi,
>>
>> I'm trying to use a RBD to act as a staging area for some data before
> pushing
>> it down to some LTO6 tapes. As I cannot use striping with the kernel
> client I
>> tend to be maxing out at around 80MB/s reads testing with DD. Has
>> anyone got any clever suggestions of giving this a bit of a boost, I
>> think I need
> to get it
>> up to around 200MB/s to make sure there is always a steady flow of
>> data to the tape drive.
>
> I've just tried the testing kernel with the blk-mq fixes in it for
> full size IO's, this combined with bumping readahead up to 4MB, is now
> getting me on average 150MB/s to 200MB/s so this might suffice.
>
> On a personal interest, I would still like to know if anyone has ideas
> on how to really push much higher bandwidth through a RBD.

Some settings in our ceph.conf that may help:

osd_op_threads = 20
osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k 
filestore_queue_max_ops = 9 filestore_flusher = false 
filestore_max_sync_interval = 10 filestore_sync_flush = false

Regards,
Alex

>
>>
>> Rbd-fuse seems to top out at 12MB/s, so there goes that option.
>>
>> I'm thinking mapping multiple RBD's and then combining them into a
>> mdadm
>> RAID0 stripe might work, but seems a bit messy.
>>
>> Any suggestions?
>>
>> Thanks,
>> Nick
>>
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to improve single thread sequential reads?

2015-08-16 Thread Alex Gorbachev
Hi Nick,

On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk  wrote:
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Nick Fisk
>> Sent: 13 August 2015 18:04
>> To: ceph-users@lists.ceph.com
>> Subject: [ceph-users] How to improve single thread sequential reads?
>>
>> Hi,
>>
>> I'm trying to use a RBD to act as a staging area for some data before
> pushing
>> it down to some LTO6 tapes. As I cannot use striping with the kernel
> client I
>> tend to be maxing out at around 80MB/s reads testing with DD. Has anyone
>> got any clever suggestions of giving this a bit of a boost, I think I need
> to get it
>> up to around 200MB/s to make sure there is always a steady flow of data to
>> the tape drive.
>
> I've just tried the testing kernel with the blk-mq fixes in it for full size
> IO's, this combined with bumping readahead up to 4MB, is now getting me on
> average 150MB/s to 200MB/s so this might suffice.
>
> On a personal interest, I would still like to know if anyone has ideas on
> how to really push much higher bandwidth through a RBD.

Some settings in our ceph.conf that may help:

osd_op_threads = 20
osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k
filestore_queue_max_ops = 9
filestore_flusher = false
filestore_max_sync_interval = 10
filestore_sync_flush = false

Regards,
Alex

>
>>
>> Rbd-fuse seems to top out at 12MB/s, so there goes that option.
>>
>> I'm thinking mapping multiple RBD's and then combining them into a mdadm
>> RAID0 stripe might work, but seems a bit messy.
>>
>> Any suggestions?
>>
>> Thanks,
>> Nick
>>
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to improve single thread sequential reads?

2015-08-13 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Nick Fisk
> Sent: 13 August 2015 18:04
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] How to improve single thread sequential reads?
> 
> Hi,
> 
> I'm trying to use a RBD to act as a staging area for some data before
pushing
> it down to some LTO6 tapes. As I cannot use striping with the kernel
client I
> tend to be maxing out at around 80MB/s reads testing with DD. Has anyone
> got any clever suggestions of giving this a bit of a boost, I think I need
to get it
> up to around 200MB/s to make sure there is always a steady flow of data to
> the tape drive.

I've just tried the testing kernel with the blk-mq fixes in it for full size
IO's, this combined with bumping readahead up to 4MB, is now getting me on
average 150MB/s to 200MB/s so this might suffice.

On a personal interest, I would still like to know if anyone has ideas on
how to really push much higher bandwidth through a RBD.

> 
> Rbd-fuse seems to top out at 12MB/s, so there goes that option.
> 
> I'm thinking mapping multiple RBD's and then combining them into a mdadm
> RAID0 stripe might work, but seems a bit messy.
> 
> Any suggestions?
> 
> Thanks,
> Nick
> 


 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to improve single thread sequential reads?

2015-08-13 Thread Nick Fisk
Hi,

 

I'm trying to use a RBD to act as a staging area for some data before
pushing it down to some LTO6 tapes. As I cannot use striping with the kernel
client I tend to be maxing out at around 80MB/s reads testing with DD. Has
anyone got any clever suggestions of giving this a bit of a boost, I think I
need to get it up to around 200MB/s to make sure there is always a steady
flow of data to the tape drive.

 

Rbd-fuse seems to top out at 12MB/s, so there goes that option.

 

I'm thinking mapping multiple RBD's and then combining them into a mdadm
RAID0 stripe might work, but seems a bit messy.

 

Any suggestions?

 

Thanks,

Nick




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com