Re: Spark on other parallel filesystems

2014-04-05 Thread Christopher Nguyen
Avati, depending on your specific deployment config, there can be up to a 10X difference in data loading time. For example, we routinely parallel load 10+GB data files across small 8-node clusters in 10-20 seconds, which would take about 100s if bottlenecked over a 1GigE network. That's about the

Re: Spark on other parallel filesystems

2014-04-05 Thread Venkat Krishnamurthy
, 2014 at 8:49 AM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Spark on other parallel filesystems Avati, depending on your specific deployment config, there can be up to a 10X difference in data loading time. For example, we

Spark on other parallel filesystems

2014-04-04 Thread Venkat Krishnamurthy
All Are there any drawbacks or technical challenges (or any information, really) related to using Spark directly on a global parallel filesystem like Lustre/GPFS? Any idea of what would be involved in doing a minimal proof of concept? Is it just possible to run Spark unmodified (without the

Re: Spark on other parallel filesystems

2014-04-04 Thread Matei Zaharia
As long as the filesystem is mounted at the same path on every node, you should be able to just run Spark and use a file:// URL for your files. The only downside with running it this way is that Lustre won’t expose data locality info to Spark, the way HDFS does. That may not matter if it’s a

Re: Spark on other parallel filesystems

2014-04-04 Thread Jeremy Freeman
We run Spark (in Standalone mode) on top of a network-mounted file system (NFS), rather than HDFS, and find it to work great. It required no modification or special configuration to set this up; as Matei says, we just point Spark to data using the file location. -- Jeremy On Apr 4, 2014, at

Re: Spark on other parallel filesystems

2014-04-04 Thread Anand Avati
On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia matei.zaha...@gmail.comwrote: As long as the filesystem is mounted at the same path on every node, you should be able to just run Spark and use a file:// URL for your files. The only downside with running it this way is that Lustre won't expose