Accelerating Spark SQL / Dataframe using GPUs & Alluxio

2021-04-23 Thread Bin Fan
authors, join our free online meetup <https://go.alluxio.io/community-alluxio-day-2021> next Tuesday morning (April 27) Pacific time. Best, - Bin Fan

Evaluating Apache Spark with Data Orchestration using TPC-DS

2021-04-08 Thread Bin Fan
reach out to me Best regards - Bin Fan

Bursting Your On-Premises Data Lake Analytics and AI Workloads on AWS

2021-02-18 Thread Bin Fan
Hi everyone! I am sharing this article about running Spark / Presto workloads on AWS: Bursting On-Premise Datalake Analytics and AI Workloads on AWS <https://bit.ly/3qA1Tom> published on AWS blog. Hope you enjoy it. Feel free to discuss with me here <https://alluxio.io/slack>. - Bin

Spark in hybrid cloud in AWS & GCP

2020-12-07 Thread Bin Fan
n-summit-2020/>. The summit has speaker lineup spans creators and committers of Alluxio, Spark, Presto, Tensorflow, K8s to data engineers and software engineers building cloud-native data and AI platforms at Amazon, Alibaba, Comcast, Facebook, Google, ING Bank, Microsoft, Tencent, and more! - Bin Fan

Building High-performance Lake for Spark using OSS, Hudi, Alluxio

2020-11-23 Thread Bin Fan
Hi Spark Users, Check out this blog on Building High-performance Data Lake using Apache Hudi, Spark and Alluxio at T3Go <https://bit.ly/373RYPi> <https://bit.ly/373RYPi> Cheers - Bin Fan

Re: Spark dataframe hdfs vs s3

2020-05-29 Thread Bin Fan
Try to deploy Alluxio as a caching layer on top of S3, providing Spark a similar HDFS interface? Like in this article: https://www.alluxio.io/blog/accelerate-spark-and-hive-jobs-on-aws-s3-by-10x-with-alluxio-tiered-storage/ On Wed, May 27, 2020 at 6:52 PM Dark Crusader wrote: > Hi Randy, > >

Re: What is directory "/path/_spark_metadata" for?

2019-11-11 Thread Bin Fan
as warnings or as errors in Alluxio master log? It will be helpful to post the stack trace if it is available. My hypothesis is that Spark in your case was testing creating such directory -Bin On Wed, Aug 28, 2019 at 1:59 AM Mark Zhao wrote: > Hey, > > When running Spark on Alluxio

How to avoid spark executor fetching jars in Local mode

2019-10-11 Thread Bin Chen
] o.a.s.e.Executor - Adding file:/tmp/spark-0365e48c-1747-4370-978f-7cd142ef0375/userFiles-3309dc5e-b6d0-4b76-a9aa-8e0a226ddab9/xxx.jar to class loader Thanks Chen Bin As a recipient of an email from Talend, your contact personal data will be on our systems. Please see our contacts privacy notice

Re: Low cache hit ratio when running Spark on Alluxio

2019-09-19 Thread Bin Fan
need more detailed instructions, feel free to join Alluxio community channel https://slackin.alluxio.io <https://www.alluxio.io/slack> - Bin Fan alluxio.io <http://bit.ly/2JctWrJ> | powered by <http://bit.ly/2JdD0N2> | Data Orchestration Summit 2019 <https://www.alluxio.io/data

Re: Can I set the Alluxio WriteType in Spark applications?

2019-09-19 Thread Bin Fan
' \--conf 'spark.executor.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH' \... Hope it helps - Bin On Tue, Sep 17, 2019 at 7:53 AM Mark Zhao wrote: > Hi, > > If Spark applications write data into alluxio, can WriteType be configured? > > Thanks, > Mark > >

Re: How to fix ClosedChannelException

2019-05-16 Thread Bin Fan
/en/reference/Properties-List.html#alluxio.user.network.netty.timeout> in your Spark jobs. Checkout how to run Spark with customized alluxio properties <https://docs.alluxio.io/os/user/stable/en/compute/Spark.html?utm_source=spark_medium=mailinglist> . - Bin On Thu, May 9, 2019 at 4:39 A

Re: How to configure alluxio cluster with spark in yarn

2019-05-16 Thread Bin Fan
Spark/YARN cluster. Here is the documentation <https://docs.alluxio.io/os/user/1.8/en/deploy/Running-Alluxio-On-Yarn.html?utm_source=spark> about deploying Alluxio with YARN. - Bin On Thu, May 9, 2019 at 4:19 AM u9g wrote: > Hey, > > I want to speed up the Spark task runn

Re: cache table vs. parquet table performance

2019-04-17 Thread Bin Fan
separate service (ideally colocated with Spark servers), of course. But also enables data sharing across Spark jobs. - Bin On Tue, Jan 15, 2019 at 10:29 AM Tomas Bartalos wrote: > Hello, > > I'm using spark-thrift server and I'm searching for best performing > solution to query hot

Re: How shall I configure the Spark executor memory size and the Alluxio worker memory size on a machine?

2019-04-04 Thread Bin Fan
g/docs/1.8/en/basic/Web-Interface.html#master-metrics> . If you see lower hit ratio, increase Alluxio storage size and vice versa. Hope this helps, - Bin On Thu, Apr 4, 2019 at 9:29 PM Bin Fan wrote: > Hi Andy, > > It really depends on your workloads. I would suggest to allocate 20% of

Re: How shall I configure the Spark executor memory size and the Alluxio worker memory size on a machine?

2019-04-04 Thread Bin Fan
n/advanced/Alluxio-Storage-Management.html#configuring-alluxio-storage ) - Bin On Thu, Mar 21, 2019 at 8:26 AM u9g wrote: > Hey, > > We have a cluster of 10 nodes each of which consists 128GB memory. We are > about to running Spark and Alluxio on the cluster. We wonder how shall

Re: Questions about caching

2018-12-24 Thread Bin Fan
unning df.write.parquet(alluxioFilePath) and your dataframes are stored in Alluxio as parquet files and you can share them with more users. One advantage with Alluxio here is you can manually free the cached data from memory tier or set the TTL for the cached data if you'd like more control on the data.

Re: off heap to alluxio/tachyon in Spark 2

2016-09-19 Thread Bin Fan
) with Alluxio may benefit performance: http://www.alluxio.com/2016/08/effective-spark-rdds-with-alluxio/ - Bin On Mon, Sep 19, 2016 at 7:56 AM, aka.fe2s <aka.f...@gmail.com> wrote: > Hi folks, > > What has happened with Tachyon / Alluxio in Spark 2? Doc doesn't mention

PySpark read from HBase

2016-08-12 Thread Bin Wang
directory? or can it read from HBase in parallel? I don't see that many examples out there so any help or guidance will be appreciated. Also, we are using Cloudera Hadoop so there might be a slight delay with the latest Spark release. Best regards, Bin

Re: Question About OFF_HEAP Caching

2016-07-18 Thread Bin Fan
Here is one blog illustrating how to use Spark on Alluxio for this purpose. Hope it will help: http://www.alluxio.com/2016/04/getting-started-with-alluxio-and-spark/ On Mon, Jul 18, 2016 at 6:36 AM, Gene Pang wrote: > Hi, > > If you want to use Alluxio with Spark 2.x, it

Re: Possible to broadcast a function?

2016-06-29 Thread Bin Fan
-on-Alluxio.html - Bin On Wed, Jun 29, 2016 at 8:40 AM, Sonal Goyal <sonalgoy...@gmail.com> wrote: > Have you looked at Alluxio? (earlier tachyon) > > Best Regards, > Sonal > Founder, Nube Technologies <http://www.nubetech.co> > Reifier at Strata Hadoop World > <https:/

How to close connection in mapPartitions?

2015-10-23 Thread Bin Wang
I use mapPartitions to open connections to Redis, I write it like this: val seqs = lines.mapPartitions { lines => val cache = new RedisCache(redisUrl, redisPort) val result = lines.map(line => Parser.parseBody(line, cache)) cache.redisPool.close result } But it

Re: How to close connection in mapPartitions?

2015-10-23 Thread Bin Wang
BTW, "lines" is a DStream. Bin Wang <wbi...@gmail.com>于2015年10月23日周五 下午2:16写道: > I use mapPartitions to open connections to Redis, I write it like this: > > val seqs = lines.mapPartitions { lines => > val cache = new RedisCache(redisUrl, redisPort) >

Dose spark auto invoke StreamingContext.stop while receive kill signal?

2015-09-23 Thread Bin Wang
I'd like the spark application to be stopped gracefully while received kill signal, so I add these code: sys.ShutdownHookThread { println("Gracefully stopping Spark Streaming Application") ssc.stop(stopSparkContext = true, stopGracefully = true) println("Application

Is it possible to merged delayed batches in streaming?

2015-09-23 Thread Bin Wang
I'm using Spark Streaming and there maybe some delays between batches. I'd like to know is it possible to merge delayed batches into one batch to do processing? For example, the interval is set to 5 min but the first batch uses 1 hour, so there are many batches delayed. In the end of processing

Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Bin Wang
07.mbox/%3CCA+AHuK=xoy8dsdaobmgm935goqytaaqkpqsvdaqpmojottj...@mail.gmail.com%3E > > Thanks > Best Regards > > On Thu, Sep 17, 2015 at 10:01 AM, Bin Wang <wbi...@gmail.com> wrote: > >> And here is another question. If I load the DStream from database every >> time I start the job, will the dat

Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Bin Wang
the values preloaded from DB > 2. By cleaning the checkpoint in between upgrades, data is loaded > only once > > Hope this helps, > -adrian > > From: Bin Wang > Date: Thursday, September 17, 2015 at 11:27 AM > To: Akhil Das > Cc: user > Subject: Re: How t

Re: How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang
torage (like a db or > zookeeper etc) to keep the state (the indexes etc) and then when you deploy > new code they can be easily recovered. > > Thanks > Best Regards > > On Wed, Sep 16, 2015 at 3:52 PM, Bin Wang <wbi...@gmail.com> wrote: > >> I'd like to know if th

How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang
I'd like to know if there is a way to recovery dstream from checkpoint. Because I stores state in DStream, I'd like the state to be recovered when I restart the application and deploy new code.

Re: How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang
And here is another question. If I load the DStream from database every time I start the job, will the data be loaded when the job is failed and auto restart? If so, both the checkpoint data and database data are loaded, won't this a problem? Bin Wang <wbi...@gmail.com>于2015年9月16日周三 下午

How to clear Kafka offset in Spark streaming?

2015-09-14 Thread Bin Wang
Hi, I'm using spark streaming with kafka and I need to clear the offset and re-compute all things. I deleted checkpoint directory in HDFS and reset kafka offset with "kafka-run-class kafka.tools.ImportZkOffsets". I can confirm the offset is set to 0 in kafka: ~ > kafka-run-class

Re: Data lost in spark streaming

2015-09-13 Thread Bin Wang
receiver for stream 0: Stopped by driver Tathagata Das <t...@databricks.com>于2015年9月13日周日 下午4:05写道: > Maybe the driver got restarted. See the log4j logs of the driver before it > restarted. > > On Thu, Sep 10, 2015 at 11:32 PM, Bin Wang <wbi...@gmail.com> wrote: > &

Data lost in spark streaming

2015-09-11 Thread Bin Wang
I'm using spark streaming 1.4.0 and have a DStream that have all the data it received. But today the history data in the DStream seems to be lost suddenly. And the application UI also lost the streaming process time and all the related data. Could any give some hint to debug this? Thanks.

Re: Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Bin Wang
16, 2015 1:33 PM, Bin Wang wbi...@gmail.com wrote: If I write code like this: val rdd = input.map(_.value) val f1 = rdd.filter(_ == 1) val f2 = rdd.filter(_ == 2) ... Then the DAG of the execution may be this: - Filter - ... Map - Filter - ... But the two filters

Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Bin Wang
If I write code like this: val rdd = input.map(_.value) val f1 = rdd.filter(_ == 1) val f2 = rdd.filter(_ == 2) ... Then the DAG of the execution may be this: - Filter - ... Map - Filter - ... But the two filters is operated on the same RDD, which means it could be done by

Spark Streaming Hangs on Start

2015-07-09 Thread Bin Wang
I'm using spark streaming with Kafka, and submit it to YARN cluster with mode yarn-cluster. But it hangs at SparkContext.start(). The Kafka config is right since it can show some events in Streaming tab of web UI. The attached file is the screen shot of the Jobs tab of web UI. The code in the

Re: Spark Streaming Hangs on Start

2015-07-09 Thread Bin Wang
Thanks for the help. I set --executor-cores and it works now. I've used --total-executor-cores and don't realize it changed. Tathagata Das t...@databricks.com于2015年7月10日周五 上午3:11写道: 1. There will be a long running job with description start() as that is the jobs that is running the receivers.

Re: How to submit streaming application and exit

2015-07-08 Thread Bin Wang
of the streaming app. On Wed, Jul 8, 2015 at 1:13 PM, Bin Wang wbi...@gmail.com wrote: I'm writing a streaming application and want to use spark-submit to submit it to a YARN cluster. I'd like to submit it in a client node and exit spark-submit after the application is running. Is it possible

How to submit streaming application and exit

2015-07-07 Thread Bin Wang
I'm writing a streaming application and want to use spark-submit to submit it to a YARN cluster. I'd like to submit it in a client node and exit spark-submit after the application is running. Is it possible?

Problem Run Spark Example HBase Code Using Spark-Submit

2015-06-25 Thread Bin Wang
am having a hard time adding it to path. This is the final spark-submit command I have but still have the class not found error. Can anyone help me with this? #!/bin/bash export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark /bin/bash $SPARK_HOME/bin/spark-submit \ --master yarn-client

Re: Specify Python interpreter

2015-05-12 Thread Bin Wang
Hi Felix and Tomoas, Thanks a lot for your information. I figured out the environment variable PYSPARK_PYTHON is the secret key. My current approach is to start iPython notebook on the namenode, export PYSPARK_PYTHON=/opt/local/anaconda/bin/ipython /opt/local/anaconda/bin/ipython notebook

Specify Python interpreter

2015-05-11 Thread Bin Wang
at the top of my Python code and use spark-submit to distribute it to the cluster. However, since I am using iPython notebook, this is not available as an option. Best, Bin

Spark on top of YARN Compression in iPython notebook

2015-05-10 Thread Bin Wang
application running on top of YARN interactively in the iPython notebook: Here is the code that I have written: import sys import os from pyspark import SparkContext, SparkConf sys.path.append('/home/hadoop/myuser/spark-1.3.1-bin-hadoop2.4/python') sys.path.append('/home/hadoop/myuser/spark-1.3.1-bin

Anaconda iPython notebook working with CDH Spark

2014-12-28 Thread Bin Wang
notebook environment. Best regards, Bin

[GraphX] Is it normal to shuffle write 15GB while the data is only 30MB?

2014-08-08 Thread Bin
Hi All, I am running a customized label propagation using Pregel. After a few iterations, the program becomes slow and wastes a lot of time in mapPartitions (at GraphImpl.scala:184 or VertexRDD.scala:318, or VertexRDD.scala:323). And the amount of shuffle write reaches 15GB, while the size of

Re:[GraphX] Can't zip RDDs with unequal numbers of partitions

2014-08-07 Thread Bin
solutions? Thanks a lot! Best, Bin 在 2014-08-06 04:54:39,Bin wubin_phi...@126.com 写道: Hi All, Finally I found that the problem occured when I called the graphx lib: Exception in thread main java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions

Can't zip RDDs with unequal numbers of partitions

2014-08-05 Thread Bin
how come the partitions were unequal, and how I can control the number of partitions of these RDD. Can someone give me some advice on this problem? Thanks very much! Best, Bin

Re:Re: Re:Re: [GraphX] The best way to construct a graph

2014-08-01 Thread Bin
Thanks for the advice. But since I am not the administrator of our spark cluster, I can't do this. Is there any better solution based on the current spark? At 2014-08-01 02:38:15, shijiaxin shijiaxin...@gmail.com wrote: Have you tried to write another similar function like edgeListFile in the

java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]

2014-07-31 Thread Bin
-program-hangs-at-job-finished-toarray-workers-throw-java-util-concurren. I also toArray my data, which was the reason of his case. However, how come it runs OK in local but not in the cluster? The memory of each worker is over 60g, and my run command is: $SPARK_HOME/bin/spark-class

Re: Spark and HBase

2014-04-08 Thread Bin Wang
and the stats functions spark has already implemented are still on the roadmap. I am not sure whether it will be good but might be something interesting to check out. /usr/bin On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier pomperma...@okkam.itwrote: Hi to everybody, in these days I looked a bit

Re: Missing Spark URL after staring the master

2014-03-04 Thread Bin Wang
... Thanks, Bin On Tue, Mar 4, 2014 at 10:59 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: I have on cloudera vm http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_Cloudera_VM which version are you trying to setup on cloudera.. also which cloudera version are you using

Spark Streaming Maven Build

2014-03-04 Thread Bin Wang
? assembly-plugin?..etc) 2. mvn install or mvn clean install or mvn install compile assembly:single? 3. after you have a jar file, then how do you execute the jar file instead of using bin/run-example... To answer those people who might ask what you have done (Here is a derivative from