Avati, depending on your specific deployment config, there can be up to a
10X difference in data loading time. For example, we routinely parallel
load 10+GB data files across small 8-node clusters in 10-20 seconds, which
would take about 100s if bottlenecked over a 1GigE network. That's about
the
, 2014 at 8:49 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Spark on other parallel filesystems
Avati, depending on your specific deployment config, there can be up to a 10X
difference in data loading time. For example, we
All
Are there any drawbacks or technical challenges (or any information, really)
related to using Spark directly on a global parallel filesystem like
Lustre/GPFS?
Any idea of what would be involved in doing a minimal proof of concept? Is it
just possible to run Spark unmodified (without the
As long as the filesystem is mounted at the same path on every node, you should
be able to just run Spark and use a file:// URL for your files.
The only downside with running it this way is that Lustre won’t expose data
locality info to Spark, the way HDFS does. That may not matter if it’s a
We run Spark (in Standalone mode) on top of a network-mounted file system
(NFS), rather than HDFS, and find it to work great. It required no modification
or special configuration to set this up; as Matei says, we just point Spark to
data using the file location.
-- Jeremy
On Apr 4, 2014, at
On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
As long as the filesystem is mounted at the same path on every node, you
should be able to just run Spark and use a file:// URL for your files.
The only downside with running it this way is that Lustre won't expose