Christopher
Just to clarify - by ‘load ops’ do you mean RDD actions that result in IO?
Venkat
From: Christopher Nguyen mailto:c...@adatao.com>>
Reply-To: "user@spark.apache.org<mailto:user@spark.apache.org>"
mailto:user@spark.apache.org>>
Date: Saturday, April 5, 2014 at 8:49 AM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
mailto:user@spark.apache.org>>
Subject: Re: Spark on other parallel filesystems
Avati, depending on your specific deployment config, there can be up to a 10X
difference in data loading time. For example, we routinely parallel load 10+GB
data files across small 8-node clusters in 10-20 seconds, which would take
about 100s if bottlenecked over a 1GigE network. That's about the max
difference for that config. If you use multiple local SSDs the difference can
be correspondingly greater, and likewise 10x smaller for 10GigE networks.
Lastly, an interesting dimension to consider is that the difference diminishes
as your data size gets much larger relative to your cluster size, since the
load ops have to be serialized in time anyway.
There is no difference after loading.
Sent while mobile. Pls excuse typos etc.
On Apr 4, 2014 10:45 PM, "Anand Avati"
mailto:av...@gluster.org>> wrote:
On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia
mailto:matei.zaha...@gmail.com>> wrote:
As long as the filesystem is mounted at the same path on every node, you should
be able to just run Spark and use a file:// URL for your files.
The only downside with running it this way is that Lustre won’t expose data
locality info to Spark, the way HDFS does. That may not matter if it’s a
network-mounted file system though.
Is the locality querying mechanism specific to HDFS mode, or is it possible to
implement plugins in Spark to query location in other ways on other
filesystems? I ask because, glusterfs can expose data location of a file
through virtual extended attributes and I would be interested in making Spark
exploit that locality when the file location is specified as glusterfs:// (or
querying the xattr blindly for file://). How much of a difference does data
locality make for Spark use cases anyways (since most of the computation
happens in memory)? Any sort of numbers?
Thanks!
Avati
Matei
On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy
mailto:ven...@yarcdata.com>> wrote:
All
Are there any drawbacks or technical challenges (or any information, really)
related to using Spark directly on a global parallel filesystem like
Lustre/GPFS?
Any idea of what would be involved in doing a minimal proof of concept? Is it
just possible to run Spark unmodified (without the HDFS substrate) for a start,
or will that not work at all? I do know that it’s possible to implement Tachyon
on Lustre and get the HDFS interface – just looking at other options.
Venkat