Hi Lionel,
Thanks for the answer.
The missing info:
1) Ceph 0.80.9 "Firefly"
2) map-reduce makes sequential reads of blocks of 64MB (or 128 MB)
3) HDFS which is running on top of Ceph is replicating data for 3 times between 
VMs which could be located on the same physical host or different hosts
4) Network is 10 Gb/s NIC (but mtu is just 1500), Open vSwitch 2.3.1
5) Ceph health is OK
6) Servers are Dell r720 128 MB RAM and 2x2TB SATA disks plus one SSD for Ceph 
journaling

I'm testing Hadoop terasort with 100GB/500GB/1TB/10TB of data.
As more data and bigger cluster as worse performance.

Any ideas how to improve it?

Thank you very much,
Dmitry


-----Original Message-----
From: Lionel Bouton [mailto:lionel+c...@bouton.name] 
Sent: Tuesday, July 07, 2015 6:07 PM
To: Dmitry Meytin
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] FW: Ceph data locality

Hi Dmitry,

On 07/07/15 14:42, Dmitry Meytin wrote:
> Hi Christian,
> Thanks for the thorough explanation.
> My case is Elastic Map Reduce on top of OpenStack with Ceph backend for 
> everything (block, object, images).
> With default configuration, performance is 300% worse than bare metal.
> I did a few changes:
> 1) replication settings 2
> 2) read ahead size 2048Kb
> 3) Max sync intervals 10s
> 4) Large queue and large bytes
> 5) OSD OP threads 20
> 6) FileStore Flusher off
> 7) Sync Flush On
> 8) Object size 64 Mb
>
> And still the performance is poor when comparing to bare-metal.

Describing how you test performance with bare-metal would help identify if this 
is expected behavior or a configuration problem. If you try to compare 
sequential access to individual local disks with Ceph it's an apple to orange 
comparison (for example Ceph RBD isn't optimized for this by default and I'm 
not sure how far stripping/order/readahead tuning can get you). If you try to 
compare random access to 3 way RAID1 devices to random access to RBD devices on 
pools with size=3 then it becomes more relevant.

I didn't see any description of the hardware and network used for Ceph which 
might help identify a bottleneck. The Ceph version is missing too.

When you test Ceph performance is ceph -s reporting HEALTH_OK (if not this 
would have performance impact)? Is there any deep-scrubbing going on (this will 
limit your IO bandwidth especially if several happens at the same time)?

> The profiling shows the huge network demand (I'm running terasort) during the 
> map phase.

It's expected with Ceph. Your network should have the capacity for your IO 
targets. Note that if your data is easy to restore you can get better write 
performance with size=1 or size=2 depending on the trade-off you want between 
durability and performance.

> I want to avoid shared-disk behavior of Ceph and I would like VM to read data 
> from the local volume as much as applicable.
> Am I wrong with mu assumptions?

Yes : Ceph is a distributed storage network, there's no provision for local 
storage. Note that 10Gbit networks (especially dual 10Gbit) and some tuning 
should in theory give you plenty of read performance with Ceph (far more than 
any local disk could provide except NVME storage or similar tech). You may be 
limited by latencies and the read or write patterns of your clients though. 
Ceph total bandwith is usually reached when you have heavy concurrent accesses.

Note that if you use map reduce with a Ceph cluster you should probably write 
any intermediate results to local storage instead of Ceph as it doesn't bring 
any real advantage for them (the only data that you should store on Ceph is 
what you want to keep after the map reduce so probably the initial input and 
the final output if it is meant to be stored).

Best regards,

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to