Re: Remove dependence on HDFS

2017-02-13 Thread Calvin Jia
Hi Ben, You can replace HDFS with a number of storage systems since Spark is compatible with other storage like S3. This would allow you to scale your compute nodes solely for the purpose of adding compute power and not disk space. You can deploy Alluxio on your compute nodes to offset the

Re: Question about Spark and filesystems

2016-12-19 Thread Calvin Jia
Hi, If you are concerned with the performance of the alternative filesystems (ie. needing a caching client), you can use Alluxio on top of any of NFS , Ceph

Re: About Spark Multiple Shared Context with Spark 2.0

2016-12-13 Thread Calvin Jia
Hi, Alluxio will allow you to share or cache data in-memory between different Spark contexts by storing RDDs or Dataframes as a file in the Alluxio system. The files can then be accessed by any Spark job like a file in any other distributed storage system. These two blogs do a good job of

Re: sanboxing spark executors

2016-11-04 Thread Calvin Jia
Hi, If you are using the latest Alluxio release (1.3.0), authorization is enabled, preventing users from accessing data they do not have permissions to. For older versions, you will need to enable the security flag. The documentation on security

Re: feasibility of ignite and alluxio for interfacing MPI and Spark

2016-09-19 Thread Calvin Jia
Hi, Alluxio allows for data sharing between applications through a File System API (Native Java Alluxio client, Hadoop FileSystem, or POSIX through fuse). If your MPI applications can use any of these interfaces, you should be able to use Alluxio for data sharing out of the box. In terms of

Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-28 Thread Calvin Jia
Hi, Thanks for the detailed information. How large is the dataset you are running against? Also did you change any Tachyon configurations? Thanks, Calvin - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For

Re: spark 1.6.0 on ec2 doesn't work

2016-01-19 Thread Calvin Jia
Hi Oleg, The Tachyon related issue should be fixed. Hope this helps, Calvin On Mon, Jan 18, 2016 at 2:51 AM, Oleg Ruchovets wrote: > Hi , >I try to follow the spartk 1.6.0 to install spark on EC2. > > It doesn't work properly - got exceptions and at the end

Re: Saving RDDs in Tachyon

2015-12-09 Thread Calvin Jia
Hi Mark, Were you able to successfully store the RDD with Akhil's method? When you read it back as an objectFile, you will also need to specify the correct type. You can find more information about integrating Spark and Tachyon on this page:

Re: Re: Spark RDD cache persistence

2015-12-09 Thread Calvin Jia
Hi Deepak, For persistence across Spark jobs, you can store and access the RDDs in Tachyon. Tachyon works with ramdisk which would give you similar in-memory performance you would have within a Spark job. For more information, you can take a look at the docs on Tachyon-Spark integration:

Re: How does Spark coordinate with Tachyon wrt data locality

2015-10-23 Thread Calvin Jia
Hi Shane, Tachyon provides an api to get the block locations of the file which Spark uses when scheduling tasks. Hope this helps, Calvin On Fri, Oct 23, 2015 at 8:15 AM, Kinsella, Shane wrote: > Hi all, > > > > I am looking into how Spark handles data locality wrt

Re: TTL for saveAsObjectFile()

2015-10-14 Thread Calvin Jia
Hi Antonio, I don't think Spark provides a way to pass down params with saveAsObjectFile. One way could be to pass a default TTL in the configuration, but the approach doesn't make much sense since TTL is not necessarily uniform. Baidu will be talking about their use of TTL in Tachyon with Spark

Re: Spark is in-memory processing, how then can Tachyon make Spark faster?

2015-08-07 Thread Calvin Jia
Hi, Tachyon http://tachyon-project.org manages memory off heap which can help prevent long GC pauses. Also, using Tachyon will allow the data to be shared between Spark jobs if they use the same dataset. Here's http://www.meetup.com/Tachyon/events/222485713/ a production use case where Baidu

Re: Spark SQL 1.3.1 saveAsParquetFile will output tachyon file with different block size

2015-04-28 Thread Calvin Jia
Hi, You can apply this patch https://github.com/apache/spark/pull/5354 and recompile. Hope this helps, Calvin On Tue, Apr 28, 2015 at 1:19 PM, sara mustafa eng.sara.must...@gmail.com wrote: Hi Zhang, How did you compile Spark 1.3.1 with Tachyon? when i changed Tachyon version to 0.6.3 in