Re: Difference between Checkpointing and Persist

2019-04-19 Thread Gene Pang
Hi Subash, I'm not sure how the checkpointing works, but with StorageLevel.MEMORY_AND_DISK, Spark will store the RDD in on-heap memory, and spill to disk if necessary. However, the data is only usable by that Spark job. Saving the RDD will write the data out to an external storage system, like

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-14 Thread Gene Pang
. > > On Apr 13, 2018, at 2:26 PM, Gene Pang <gene.p...@gmail.com> wrote: > > Hi Jason, > > Alluxio does work with Spark in master=local mode. This is because both > spark-submit and spark-shell have command-line options to set the classpath > for the JVM that is being sta

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Gene Pang
Hi Jason, Alluxio does work with Spark in master=local mode. This is because both spark-submit and spark-shell have command-line options to set the classpath for the JVM that is being started. If you are not using spark-submit or spark-shell, you will have to figure out how to configure that JVM

Re: share datasets across multiple spark-streaming applications for lookup

2017-10-31 Thread Gene Pang
Hi, Alluxio enables sharing dataframes across different applications. This blog post talks about dataframes and Alluxio, and this Spark Summit presentation

Re: "Sharing" dataframes...

2017-06-21 Thread Gene Pang
Hi Jean, As others have mentioned, you can use Alluxio with Spark dataframes to keep the data in memory, and for other jobs to read them from memory again. Hope this helps, Gene On Wed, Jun 21, 2017 at 8:08 AM, Jean Georges

Re: An Architecture question on the use of virtualised clusters

2017-06-02 Thread Gene Pang
As Vincent mentioned earlier, I think Alluxio can work for this. You can mount your (potentially remote) storage systems to Alluxio , and deploy Alluxio co-located to the compute cluster. The computation framework will

Re: Are tachyon and akka removed from 2.1.1 please

2017-05-22 Thread Gene Pang
Hi, Tachyon has been renamed to Alluxio. Here is the documentation for running Alluxio with Spark . Hope this helps, Gene On Sun, May 21, 2017 at 6:15 PM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > HI all, > Iread some paper about

Re: Spark <--> S3 flakiness

2017-05-12 Thread Gene Pang
Hi, Yes, you can use Alluxio with Spark to read/write to S3. Here is a blog post on Spark + Alluxio + S3 , and here is some documentation for configuring Alluxio + S3

Re: [Spark Streaming] Dynamic Broadcast Variable Update

2017-05-04 Thread Gene Pang
As Tim pointed out, Alluxio (renamed from Tachyon) may be able to help you. Here is some documentation on how to run Alluxio and Spark together , and here is a blog post on a Spark streaming + Alluxio use case

Re: Spark SQL - Global Temporary View is not behaving as expected

2017-04-24 Thread Gene Pang
As Vincent mentioned, Alluxio helps with sharing data across different Spark contexts. This blog post about Spark dataframes and Alluxio discusses that use case . Thanks, Gene On Sat, Apr 22, 2017 at 2:14 AM, vincent gromakowski <

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-21 Thread Gene Pang
Hi Georg, Yes, that should be possible with Alluxio. Tachyon was renamed to Alluxio. This article on how Alluxio is used for a Spark streaming use case may be helpful. Thanks, Gene On Fri, Apr

Re: Spark 2.x OFF_HEAP persistence

2017-01-09 Thread Gene Pang
.saveAsTextFile(alluxioPath) > / rdd.saveAsObjectFile (alluxioPath) for guarantees like persisted rdd > surviving a Spark JVM crash etc, as also the other benefits you mention. > > Vin. > > On Thu, Jan 5, 2017 at 2:50 AM, Gene Pang <gene.p...@gmail.com> wrote: > >> Hi

Re: Spark 2.x OFF_HEAP persistence

2017-01-04 Thread Gene Pang
Hi Vin, >From Spark 2.x, OFF_HEAP was changed to no longer directly interface with an external block store. The previous tight dependency was restrictive and reduced flexibility. It looks like the new version uses the executor's off heap memory to allocate direct byte buffers, and does not

Re: Sharing RDDS across applications and users

2016-10-27 Thread Gene Pang
Hi Mich, Yes, Alluxio is commonly used to cache and share Spark RDDs and DataFrames among different applications and contexts. The data typically stays in memory, but with Alluxio's tiered storage, the "colder" data can be evicted out to other medium, like SSDs and HDDs. Here is a blog post

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-19 Thread Gene Pang
Hi Mich, While Alluxio is not a database (it exposes a file system interface), you can use Alluxio to keep certain data in memory. With Alluxio, you can selectively pin data in memory (http://www.alluxio.org/docs/ master/en/Command-Line-Interface.html#pin). There are also ways to control how to

Re: Question About OFF_HEAP Caching

2016-07-18 Thread Gene Pang
Hi, If you want to use Alluxio with Spark 2.x, it is recommended to write to and read from Alluxio with files. You can save an RDD with saveAsObjectFile with an Alluxio path (alluxio://host:port/path/to/file), and you can read that file from any other Spark job. Here is additional information on

Re: Why so many parquet file part when I store data in Alluxio or File?

2016-07-08 Thread Gene Pang
Hi Chanh, You should be able to set the Alluxio block size with: sc.hadoopConfiguration.set("alluxio.user.block.size.bytes.default", "256mb") I think you have many parquet files because you have many Spark executors writing out their partition of the files. Hope that helps, Gene On Sun, Jul

Re: Best practice for handing tables between pipeline components

2016-06-27 Thread Gene Pang
Yes, Alluxio (http://www.alluxio.org/) can be used to store data in-memory between stages in a pipeline. Here is more information about running Spark with Alluxio: http://www.alluxio.org/documentation/v1.1.0/en/Running-Spark-on-Alluxio.html Hope that helps, Gene On Mon, Jun 27, 2016 at 10:38

Re: Spark 2.0 Preview After caching query didn't work and can't kill job.

2016-06-21 Thread Gene Pang
> > > On Jun 15, 2016, at 8:44 PM, Chanh Le <giaosu...@gmail.com> wrote: > > Hi Gene, > I am using Alluxio 1.1.0. > Spark 2.0 Preview version. > Load from alluxio then cached and query for 2nd time. Spark will stuck. > > > > On Jun 15, 2016, at 8:42 PM, Ge

Re: Limit pyspark.daemon threads

2016-06-15 Thread Gene Pang
As Sven mentioned, you can use Alluxio to store RDDs in off-heap memory, and you can then share that RDD across different jobs. If you would like to run Spark on Alluxio, this documentation can help: http://www.alluxio.org/documentation/master/en/Running-Spark-on-Alluxio.html Thanks, Gene On

Re: Spark 2.0 Preview After caching query didn't work and can't kill job.

2016-06-15 Thread Gene Pang
Hi, Which version of Alluxio are you using? Thanks, Gene On Tue, Jun 14, 2016 at 3:45 AM, Chanh Le wrote: > I am testing Spark 2.0 > I load data from alluxio and cached then I query but the first query is ok > because it kick off cache action. But after that I run the

Re: Silly Question on my part...

2016-05-17 Thread Gene Pang
Hi Michael, Yes, you can use Alluxio to share Spark RDDs. Here is a blog post about getting started with Spark and Alluxio ( http://www.alluxio.com/2016/04/getting-started-with-alluxio-and-spark/), and some documentation ( http://alluxio.org/documentation/master/en/Running-Spark-on-Alluxio.html).

Re: Spark partition size tuning

2016-01-26 Thread Gene Pang
Hi Jia, If you want to change the Tachyon block size, you can set the tachyon.user.block.size.bytes.default parameter ( http://tachyon-project.org/documentation/Configuration-Settings.html). You can set it via extraJavaOptions per job, or adding it to tachyon-site.properties. I hope that helps,

Re: How to query data in tachyon with spark-sql

2016-01-24 Thread Gene Pang
Hi, You should be able to point Hive to Tachyon instead of HDFS, and that should allow Hive to access data in Tachyon. If Spark SQL was pointing to an HDFS file, you could instead point it to a Tachyon file, and that should work too. Hope that helps, Gene On Wed, Jan 20, 2016 at 2:06 AM, Sea

Re: Reuse Executor JVM across different JobContext

2016-01-19 Thread Gene Pang
Yes, you can share RDDs with Tachyon, while keeping the data in memory. Spark jobs can write to a Tachyon path (tachyon://host:port/path/) and other jobs can read from the same path. Here is a presentation that includes that use case:

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-14 Thread Gene Pang
" support? >> >> It seems that the practice I hear the most about is the idea of loading >> resources as RDD's and then doing join's against them to achieve the lookup >> effect. >> >> The other approach would be to load the resources into broadcast >> var

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Gene Pang
Hi Dmitry, Yes, Tachyon can help with your use case. You can read and write to Tachyon via the filesystem api ( http://tachyon-project.org/documentation/File-System-API.html). There is a native Java API as well as a Hadoop-compatible API. Spark is also able to interact with Tachyon via the

Re: org.apache.spark.storage.BlockNotFoundException in Spark1.5.2+Tachyon0.7.1

2016-01-09 Thread Gene Pang
Yes, the tiered storage feature in Tachyon can address this issue. Here is a link to more information: http://tachyon-project.org/documentation/Tiered-Storage-on-Tachyon.html Thanks, Gene On Wed, Jan 6, 2016 at 8:44 PM, Ted Yu wrote: > Have you seen this thread ? > >