Accelerating Spark SQL / Dataframe using GPUs & Alluxio

2021-04-23 Thread Bin Fan
authors, join our free online meetup <https://go.alluxio.io/community-alluxio-day-2021> next Tuesday morning (April 27) Pacific time. Best, - Bin Fan

Evaluating Apache Spark with Data Orchestration using TPC-DS

2021-04-08 Thread Bin Fan
reach out to me Best regards - Bin Fan

Bursting Your On-Premises Data Lake Analytics and AI Workloads on AWS

2021-02-18 Thread Bin Fan
Hi everyone! I am sharing this article about running Spark / Presto workloads on AWS: Bursting On-Premise Datalake Analytics and AI Workloads on AWS <https://bit.ly/3qA1Tom> published on AWS blog. Hope you enjoy it. Feel free to discuss with me here <https://alluxio.io/slack>. - Bin

Spark in hybrid cloud in AWS & GCP

2020-12-07 Thread Bin Fan
n-summit-2020/>. The summit has speaker lineup spans creators and committers of Alluxio, Spark, Presto, Tensorflow, K8s to data engineers and software engineers building cloud-native data and AI platforms at Amazon, Alibaba, Comcast, Facebook, Google, ING Bank, Microsoft, Tencent, and more! - Bin Fan

Building High-performance Lake for Spark using OSS, Hudi, Alluxio

2020-11-23 Thread Bin Fan
Hi Spark Users, Check out this blog on Building High-performance Data Lake using Apache Hudi, Spark and Alluxio at T3Go <https://bit.ly/373RYPi> <https://bit.ly/373RYPi> Cheers - Bin Fan

Re: Spark dataframe hdfs vs s3

2020-05-29 Thread Bin Fan
Try to deploy Alluxio as a caching layer on top of S3, providing Spark a similar HDFS interface? Like in this article: https://www.alluxio.io/blog/accelerate-spark-and-hive-jobs-on-aws-s3-by-10x-with-alluxio-tiered-storage/ On Wed, May 27, 2020 at 6:52 PM Dark Crusader wrote: > Hi Randy, > >

Re: What is directory "/path/_spark_metadata" for?

2019-11-11 Thread Bin Fan
Hey Mark, I believe this is the name of the subdirectory that is used to store metadata about which files are valid, see comment in code https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L33 Do you see the exception

Re: Low cache hit ratio when running Spark on Alluxio

2019-09-19 Thread Bin Fan
need more detailed instructions, feel free to join Alluxio community channel https://slackin.alluxio.io <https://www.alluxio.io/slack> - Bin Fan alluxio.io <http://bit.ly/2JctWrJ> | powered by <http://bit.ly/2JdD0N2> | Data Orchestration Summit 2019 <https://www.alluxio.io/data

Re: Can I set the Alluxio WriteType in Spark applications?

2019-09-19 Thread Bin Fan
Hi Mark, You can follow the instructions here: https://docs.alluxio.io/os/user/stable/en/compute/Spark.html#customize-alluxio-user-properties-for-individual-spark-jobs Something like this: $ spark-submit \--conf 'spark.driver.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH'

Re: How to fix ClosedChannelException

2019-05-16 Thread Bin Fan
Hi This *java.nio.channels.ClosedChannelException* is often caused by a connection timeout between your Spark executors and Alluxio workers. One simple and quick fix is to increase the timeout value to be larger alluxio.user.network.netty.timeout

Re: How to configure alluxio cluster with spark in yarn

2019-05-16 Thread Bin Fan
hi Andy Assuming you are running Spark with YARN, then I would recommend deploying Alluxio in the same YARN cluster if you are looking for best performance. Alluxio can also be deployed separated as a standalone service, but in that case, you may need to transfer data from Alluxio cluster to your

Re: cache table vs. parquet table performance

2019-04-17 Thread Bin Fan
Hi Tomas, One option is to cache your table as Parquet files into Alluxio (which can serve as an in-memory distributed caching layer for Spark in your case). The code on Spark will be like > df.write.parquet("alluxio://master:19998/data.parquet")> df = >

Re: How shall I configure the Spark executor memory size and the Alluxio worker memory size on a machine?

2019-04-04 Thread Bin Fan
g/docs/1.8/en/basic/Web-Interface.html#master-metrics> . If you see lower hit ratio, increase Alluxio storage size and vice versa. Hope this helps, - Bin On Thu, Apr 4, 2019 at 9:29 PM Bin Fan wrote: > Hi Andy, > > It really depends on your workloads. I would suggest to allocate 20% of

Re: How shall I configure the Spark executor memory size and the Alluxio worker memory size on a machine?

2019-04-04 Thread Bin Fan
Hi Andy, It really depends on your workloads. I would suggest to allocate 20% of the size of your input data set as the starting point and see how it works. Also depending on your data source as the under store of Alluxio, if it is remote (e.g., cloud storage like S3 or GCS), you can perhaps use

Re: Questions about caching

2018-12-24 Thread Bin Fan
Hi Andrew, Since you mentioned the alternative solution with Alluxio , here is a more comprehensive tutorial on caching Spark dataframes on Alluxio: https://www.alluxio.com/blog/effective-spark-dataframes-with-alluxio Namely, caching your dataframe is simply running

Re: off heap to alluxio/tachyon in Spark 2

2016-09-19 Thread Bin Fan
Hi, If you are looking for how to run Spark on Alluxio (formerly Tachyon), here is the documentation from Alluxio doc site: http://www.alluxio.org/docs/master/en/Running-Spark-on-Alluxio.html It still works for Spark 2.x. Alluxio team also published articles on when and why running Spark (2.x)

Re: Question About OFF_HEAP Caching

2016-07-18 Thread Bin Fan
Here is one blog illustrating how to use Spark on Alluxio for this purpose. Hope it will help: http://www.alluxio.com/2016/04/getting-started-with-alluxio-and-spark/ On Mon, Jul 18, 2016 at 6:36 AM, Gene Pang wrote: > Hi, > > If you want to use Alluxio with Spark 2.x, it

Re: Possible to broadcast a function?

2016-06-29 Thread Bin Fan
following this suggestion, Aaron, you may take a look at Alluxio as the off-heap in-memory data storage as input/output for Spark jobs if that works for you. See more intro on how to run Spark with Alluxio as data input / output.