unsubscribe

2014-05-18 Thread Venkat Krishnamurthy




Re: Spark on other parallel filesystems

2014-04-05 Thread Venkat Krishnamurthy
Christopher

Just to clarify - by ‘load ops’ do you mean RDD actions that result in IO?

Venkat
From: Christopher Nguyen mailto:c...@adatao.com>>
Reply-To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Date: Saturday, April 5, 2014 at 8:49 AM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: Spark on other parallel filesystems


Avati, depending on your specific deployment config, there can be up to a 10X 
difference in data loading time. For example, we routinely parallel load 10+GB 
data files across small 8-node clusters in 10-20 seconds, which would take 
about 100s if bottlenecked over a 1GigE network. That's about the max 
difference for that config. If you use multiple local SSDs the difference can 
be correspondingly greater, and likewise 10x smaller for 10GigE networks.

Lastly, an interesting dimension to consider is that the difference diminishes 
as your data size gets much larger relative to your cluster size, since the 
load ops have to be serialized in time anyway.

There is no difference after loading.

Sent while mobile. Pls excuse typos etc.

On Apr 4, 2014 10:45 PM, "Anand Avati" 
mailto:av...@gluster.org>> wrote:



On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia 
mailto:matei.zaha...@gmail.com>> wrote:
As long as the filesystem is mounted at the same path on every node, you should 
be able to just run Spark and use a file:// URL for your files.

The only downside with running it this way is that Lustre won’t expose data 
locality info to Spark, the way HDFS does. That may not matter if it’s a 
network-mounted file system though.

Is the locality querying mechanism specific to HDFS mode, or is it possible to 
implement plugins in Spark to query location in other ways on other 
filesystems? I ask because, glusterfs can expose data location of a file 
through virtual extended attributes and I would be interested in making Spark 
exploit that locality when the file location is specified as glusterfs:// (or 
querying the xattr blindly for file://). How much of a difference does data 
locality make for Spark use cases anyways (since most of the computation 
happens in memory)? Any sort of numbers?

Thanks!
Avati


Matei

On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy 
mailto:ven...@yarcdata.com>> wrote:

All

Are there any drawbacks or technical challenges (or any information, really) 
related to using Spark directly on a global parallel filesystem  like 
Lustre/GPFS?

Any idea of what would be involved in doing a minimal proof of concept? Is it 
just possible to run Spark unmodified (without the HDFS substrate) for a start, 
or will that not work at all? I do know that it’s possible to implement Tachyon 
on Lustre and get the HDFS interface – just looking at other options.

Venkat




Spark on other parallel filesystems

2014-04-04 Thread Venkat Krishnamurthy
All

Are there any drawbacks or technical challenges (or any information, really) 
related to using Spark directly on a global parallel filesystem  like 
Lustre/GPFS?

Any idea of what would be involved in doing a minimal proof of concept? Is it 
just possible to run Spark unmodified (without the HDFS substrate) for a start, 
or will that not work at all? I do know that it’s possible to implement Tachyon 
on Lustre and get the HDFS interface – just looking at other options.

Venkat