Hi Christian,
Thanks for the thorough explanation.
My case is Elastic Map Reduce on top of OpenStack with Ceph backend for 
everything (block, object, images).
With default configuration, performance is 300% worse than bare metal.
I did a few changes:
1) replication settings 2
2) read ahead size 2048Kb 
3) Max sync intervals 10s
4) Large queue and large bytes
5) OSD OP threads 20
6) FileStore Flusher off
7) Sync Flush On
8) Object size 64 Mb

And still the performance is poor when comparing to bare-metal.
The profiling shows the huge network demand (I'm running terasort) during the 
map phase.
I want to avoid shared-disk behavior of Ceph and I would like VM to read data 
from the local volume as much as applicable.
Am I wrong with mu assumptions?

Thank you very much,
Dmitry

-----Original Message-----
From: Christian Balzer [mailto:ch...@gol.com] 
Sent: 07 July 2015 15:25
To: ceph-us...@ceph.com
Cc: Dmitry Meytin
Subject: Re: [ceph-users] FW: Ceph data locality


Hello,

On Tue, 7 Jul 2015 11:45:11 +0000 Dmitry Meytin wrote:

> I think it's essential for huge data clusters to deal with data locality.
> Even very expensive network stack (100Gb/s) will not mitigate the 
> problem if you need to move petabytes of data many times a day. Maybe 
> there is some workaround  to the problem?
>
Apples, Oranges. 
Not every problem is a nail even if your preferred tool is a hammer.
 
Ceph is a _distributed_ storage system. 
Data locality is not one of its design goals, a SAN or NAS isn't really "local" 
to any client either.

The design is to have completely independent storage nodes talking to clients.

And if you have petabytes of data each day, there is no way to have all that 
data local.
Never mind that by default, you will have 3 replicas, so 2 other nodes will 
have to receive that data over the network anyway. 
And the write won't be complete until it is done.

If you scale allows it or if you can enforce a split on a layer above the 
storage into smaller segments, DRBD will give you local reads. 
Of course that IS limited to the amount of disk space per node.

Ceph on the other hand can scale massively and with each OSD and storage node 
the performance increases. 

That all being said, if you would have googled a bit or read all the 
documentation, you would have found references to primary affinity.
And other methods to have some form of data locality for use with Hadoop.
All of which aren't particular good or scalable. 

Would locality be nice in some use cases? 
Hell yeah, but not at the cost of other, much more pressing issues.
Like the ability for Ceph to actually repair itself w/o human intervention and 
a magic 8 ball. 

Christian

> 
> 
> From: Van Leeuwen, Robert [mailto:rovanleeu...@ebay.com]
> Sent: Tuesday, July 07, 2015 12:59 PM
> To: Dmitry Meytin
> Subject: Re: [ceph-users] Ceph data locality
> 
> > I need a help to configure clients to write data to the primary osd 
> > on the local server. I see a lot of networking when VM is trying to 
> > read data which was written by the same VM, What I'm expecting to is 
> > the VM to read data from the local machine as the first replica of the data.
> > How to configure the CRUSH rules to make it happen?
> 
> This functionality is not in Ceph.
> Ceph has no notion about locality: faster "local nodes" vs slower 
> "remote nodes". The only thing you can configure is a failure domain 
> which just makes sure the data is properly spread across the DC.
> 
> Cheers,
> Robert van Leeuwen
> 


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to