Re: [ceph-users] ceph cache tier clean rate too low

2016-04-19 Thread Josef Johansson
Hi,

response in line

On 20 Apr 2016 7:45 a.m., "Christian Balzer"  wrote:
>
>
> Hello,
>
> On Wed, 20 Apr 2016 03:42:00 + Stephen Lord wrote:
>
> >
> > OK, you asked ;-)
> >
>
> I certainly did. ^o^
>
> > This is all via RBD, I am running a single filesystem on top of 8 RBD
> > devices in an effort to get data striping across more OSDs, I had been
> > using that setup before adding the cache tier.
> >
> Nods.
> Depending on your use case (sequential writes) actual RADOS striping might
> be more advantageous than this (with 4MB writes still going to the same
> PG/OSD all the time).
>
>
> > 3 nodes with 11 6 Tbyte SATA drives each for a base RBD pool, this is
> > setup with replication size 3. No SSDs involved in those OSDs, since
> > ceph-disk does not let you break a bluestore configuration into more
> > than one device at the moment.
> >
> That's a pity, but supposedly just  a limitation of ceph-disk.
> I'd venture you can work around that with symlinks to a raw SSD
> partition, same as with current filestore journals.
>
> As Sage recently wrote:
> ---
> BlueStore can use as many as three devices: one for the WAL (journal,
> though it can be much smaller than FileStores, e.g., 128MB), one for
> metadata (e.g., an SSD partition), and one for data.
> ---

I believe he also mentioned the use of bcache and friends for the osd,
maybe a way forward in this case?

Regards
Josef
>
> > The 600 Mbytes/sec is an approx sustained number for the data rate I can
> > get going into this pool via RBD, that turns into 3 times that for raw
> > data rate, so at 33 drives that is mid 50s Mbytes/sec per drive. I have
> > pushed it harder than that from time to time, but the OSD really wants
> > to use fdatasync a lot and that tends to suck up a lot of the potential
> > of a device, these disks will do 160 Mbytes/sec if you stream data to
> > them.
> >
> > I just checked with rados bench to this set of 33 OSDs with a 3 replica
> > pool, and 600 Mbytes/sec is what it will do from the same client host.
> >
> This matches a cluster of mine with 32 OSDs (filestore of course) and SSD
> journals on 4 nodes with a replica of 3.
>
> So BlueStore is indeed faster than than filestore.
>
> > All the networking is 40 GB ethernet, single port per host, generally I
> > can push 2.2 Gbytes/sec in one direction between two hosts over a single
> > tcp link, the max I have seen is about 2.7 Gbytes/sec coming into a
> > node. Short of going to RDMA that appears to be about the limit for
> > these processors.
> >
> Yeah, didn't expect your network to be involved here bottleneck wise, but
> a good data point to have nevertheless.
>
> > There are a grand total of 2 400 GB P3700s which are running a pool with
> > a replication factor of 1, these are in 2 other nodes. Once I add in
> > replication perf goes downhill. If I had more hardware I would be
> > running more of these and using replication, but I am out of network
> > cards right now.
> >
> Alright, so at 900MB/s you're pretty close to what one would expect from 2
> of these: 1080MB/s*2/2(journal).
>
> How much downhill is that?
>
> I have a production cache tier with 2 nodes (replica 2 of course) and 4
> 800GB DC S3610s each, IPoIB QDR (40Gbs) interconnect and the performance
> is pretty much what I would expect.
>
> > So 5 nodes running OSDs, and a 6th node running the RBD client using the
> > kernel implementation.
> >
> I assume there's are reason for use the kernel RBD client (which kernel?),
> given that it tends to be behind the curve in terms of features and speed?
>
> > Complete set of commands for creating the cache tier, I pulled this from
> > history, so the line in the middle was a failed command actually so
> > sorry for the red herring.
> >
> >   982  ceph osd pool create nvme 512 512 replicated_nvme
> >   983  ceph osd pool set nvme size 1
> >   984  ceph osd tier add rbd nvme
> >   985  ceph osd tier cache-mode  nvme writeback
> >   986  ceph osd tier set-overlay rbd nvme
> >   987  ceph osd pool set nvme  hit_set_type bloom
> >   988  ceph osd pool set target_max_bytes 5000 <<—— typo here,
> > so never mind 989  ceph osd pool set nvme target_max_bytes 5000
> >   990  ceph osd pool set nvme target_max_objects 50
> >   991  ceph osd pool set nvme cache_target_dirty_ratio 0.5
> >   992  ceph osd pool set nvme cache_target_full_ratio 0.8
> >
> > I wish the cache tier would cause a health warning if it does not have
> > a max size set, it lets you do that, flushes nothing and fills the OSDs.
> >
> Oh yes, people have been bitten by this over and over again.
> At least it's documented now.
>
> > As for what the actual test is, this is 4K uncompressed DPX video
frames,
> > so 50 Mbyte files written at least 24 a second on a good day, ideally
> > more. This needs to sustain around 1.3 Gbytes/sec in either direction
> > from a single application and needs to do it consistently. There is a
> > certain amount of buffering to deal with 

Re: [ceph-users] ceph cache tier clean rate too low

2016-04-19 Thread Christian Balzer

Hello,

On Wed, 20 Apr 2016 03:42:00 + Stephen Lord wrote:

> 
> OK, you asked ;-)
>

I certainly did. ^o^
 
> This is all via RBD, I am running a single filesystem on top of 8 RBD
> devices in an effort to get data striping across more OSDs, I had been
> using that setup before adding the cache tier.
>
Nods.
Depending on your use case (sequential writes) actual RADOS striping might
be more advantageous than this (with 4MB writes still going to the same
PG/OSD all the time).

 
> 3 nodes with 11 6 Tbyte SATA drives each for a base RBD pool, this is
> setup with replication size 3. No SSDs involved in those OSDs, since
> ceph-disk does not let you break a bluestore configuration into more
> than one device at the moment.
> 
That's a pity, but supposedly just  a limitation of ceph-disk. 
I'd venture you can work around that with symlinks to a raw SSD
partition, same as with current filestore journals.

As Sage recently wrote:
---
BlueStore can use as many as three devices: one for the WAL (journal, 
though it can be much smaller than FileStores, e.g., 128MB), one for 
metadata (e.g., an SSD partition), and one for data.
---

> The 600 Mbytes/sec is an approx sustained number for the data rate I can
> get going into this pool via RBD, that turns into 3 times that for raw
> data rate, so at 33 drives that is mid 50s Mbytes/sec per drive. I have
> pushed it harder than that from time to time, but the OSD really wants
> to use fdatasync a lot and that tends to suck up a lot of the potential
> of a device, these disks will do 160 Mbytes/sec if you stream data to
> them.
> 
> I just checked with rados bench to this set of 33 OSDs with a 3 replica
> pool, and 600 Mbytes/sec is what it will do from the same client host.
> 
This matches a cluster of mine with 32 OSDs (filestore of course) and SSD
journals on 4 nodes with a replica of 3.

So BlueStore is indeed faster than than filestore.

> All the networking is 40 GB ethernet, single port per host, generally I
> can push 2.2 Gbytes/sec in one direction between two hosts over a single
> tcp link, the max I have seen is about 2.7 Gbytes/sec coming into a
> node. Short of going to RDMA that appears to be about the limit for
> these processors.
> 
Yeah, didn't expect your network to be involved here bottleneck wise, but
a good data point to have nevertheless. 

> There are a grand total of 2 400 GB P3700s which are running a pool with
> a replication factor of 1, these are in 2 other nodes. Once I add in
> replication perf goes downhill. If I had more hardware I would be
> running more of these and using replication, but I am out of network
> cards right now.
> 
Alright, so at 900MB/s you're pretty close to what one would expect from 2
of these: 1080MB/s*2/2(journal).

How much downhill is that?

I have a production cache tier with 2 nodes (replica 2 of course) and 4
800GB DC S3610s each, IPoIB QDR (40Gbs) interconnect and the performance
is pretty much what I would expect.

> So 5 nodes running OSDs, and a 6th node running the RBD client using the
> kernel implementation.
> 
I assume there's are reason for use the kernel RBD client (which kernel?),
given that it tends to be behind the curve in terms of features and speed?

> Complete set of commands for creating the cache tier, I pulled this from
> history, so the line in the middle was a failed command actually so
> sorry for the red herring.
> 
>   982  ceph osd pool create nvme 512 512 replicated_nvme 
>   983  ceph osd pool set nvme size 1
>   984  ceph osd tier add rbd nvme
>   985  ceph osd tier cache-mode  nvme writeback
>   986  ceph osd tier set-overlay rbd nvme 
>   987  ceph osd pool set nvme  hit_set_type bloom 
>   988  ceph osd pool set target_max_bytes 5000 <<—— typo here,
> so never mind 989  ceph osd pool set nvme target_max_bytes 5000
>   990  ceph osd pool set nvme target_max_objects 50
>   991  ceph osd pool set nvme cache_target_dirty_ratio 0.5
>   992  ceph osd pool set nvme cache_target_full_ratio 0.8
> 
> I wish the cache tier would cause a health warning if it does not have
> a max size set, it lets you do that, flushes nothing and fills the OSDs.
> 
Oh yes, people have been bitten by this over and over again.
At least it's documented now.

> As for what the actual test is, this is 4K uncompressed DPX video frames,
> so 50 Mbyte files written at least 24 a second on a good day, ideally
> more. This needs to sustain around 1.3 Gbytes/sec in either direction
> from a single application and needs to do it consistently. There is a
> certain amount of buffering to deal with fluctuations in perf. I am
> pushing 4096 of these files sequentially with a queue depth of 32 so
> there is rather a lot of data in flight at any one time. I know I do not
> have enough hardware to achieve this rate on writes.
>
So this is your test AND actual intended use case I presume, right? 

> The are being written with direct I/O into a pool of 8 RBD LUNs. The 8
> LUN setup will not really help 

[ceph-users] ceph cache tier clean rate too low

2016-04-19 Thread Stephen Lord


I Have a setup using some Intel P3700 devices as a cache tier, and 33 sata 
drives hosting the pool behind them. I setup the cache tier with writeback, 
gave it a size and max object count etc:

 ceph osd pool set target_max_bytes 5000
 ceph osd pool set nvme target_max_bytes 5000
 ceph osd pool set nvme target_max_objects 50
 ceph osd pool set nvme cache_target_dirty_ratio 0.5
 ceph osd pool set nvme cache_target_full_ratio 0.8

This is all running Jewel using bluestore OSDs (I know experimental). The cache 
tier will write at about 900 Mbytes/sec and read at 2.2 Gbytes/sec, the sata 
pool can take writes at about 600 Mbytes/sec in aggregate. However, it looks 
like the mechanism for cleaning the cache down to the disk layer is being 
massively rate limited and I see about 47 Mbytes/sec of read activity from each 
SSD while this is going on.

This means that while I could be pushing data into the cache at high speed, It 
cannot evict old content very fast at all, and it is very easy to hit the high 
water mark and the application I/O drops dramatically as it becomes throttled 
by how fast the cache can flush.

I suspect it is operating on a placement group at a time so ends up targeting a 
very limited number of objects and hence disks at any one time. I can see 
individual disk drives going busy for very short periods, but most of them are 
idle at any one point in time. The only way to drive the disk based OSDs fast 
is to hit a lot of them at once which would mean issuing many cache flush 
operations in parallel.

Are there any controls which can influence this behavior?

Thanks

  Steve

--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com