Accelerating Spark SQL / Dataframe using GPUs & Alluxio

2021-04-23 Thread Bin Fan
authors, join our free online meetup <https://go.alluxio.io/community-alluxio-day-2021> next Tuesday morning (April 27) Pacific time. Best, - Bin Fan

Evaluating Apache Spark with Data Orchestration using TPC-DS

2021-04-08 Thread Bin Fan
reach out to me Best regards - Bin Fan

Bursting Your On-Premises Data Lake Analytics and AI Workloads on AWS

2021-02-18 Thread Bin Fan
Hi everyone! I am sharing this article about running Spark / Presto workloads on AWS: Bursting On-Premise Datalake Analytics and AI Workloads on AWS <https://bit.ly/3qA1Tom> published on AWS blog. Hope you enjoy it. Feel free to discuss with me here <https://alluxio.io/slack>. - Bin

Spark in hybrid cloud in AWS & GCP

2020-12-07 Thread Bin Fan
n-summit-2020/>. The summit has speaker lineup spans creators and committers of Alluxio, Spark, Presto, Tensorflow, K8s to data engineers and software engineers building cloud-native data and AI platforms at Amazon, Alibaba, Comcast, Facebook, Google, ING Bank, Microsoft, Tencent, and more! - Bin Fan

Building High-performance Lake for Spark using OSS, Hudi, Alluxio

2020-11-23 Thread Bin Fan
Hi Spark Users, Check out this blog on Building High-performance Data Lake using Apache Hudi, Spark and Alluxio at T3Go <https://bit.ly/373RYPi> <https://bit.ly/373RYPi> Cheers - Bin Fan

Re: Spark dataframe hdfs vs s3

2020-05-29 Thread Bin Fan
Try to deploy Alluxio as a caching layer on top of S3, providing Spark a similar HDFS interface? Like in this article: https://www.alluxio.io/blog/accelerate-spark-and-hive-jobs-on-aws-s3-by-10x-with-alluxio-tiered-storage/ On Wed, May 27, 2020 at 6:52 PM Dark Crusader wrote: > Hi Randy, > > Ye

Re: What is directory "/path/_spark_metadata" for?

2019-11-11 Thread Bin Fan
as warnings or as errors in Alluxio master log? It will be helpful to post the stack trace if it is available. My hypothesis is that Spark in your case was testing creating such directory -Bin On Wed, Aug 28, 2019 at 1:59 AM Mark Zhao wrote: > Hey, > > When running Spark on Alluxio

How to avoid spark executor fetching jars in Local mode

2019-10-11 Thread Bin Chen
] o.a.s.e.Executor - Adding file:/tmp/spark-0365e48c-1747-4370-978f-7cd142ef0375/userFiles-3309dc5e-b6d0-4b76-a9aa-8e0a226ddab9/xxx.jar to class loader Thanks Chen Bin As a recipient of an email from Talend, your contact personal data will be on our systems. Please see our contacts privacy notice

Re: Low cache hit ratio when running Spark on Alluxio

2019-09-19 Thread Bin Fan
need more detailed instructions, feel free to join Alluxio community channel https://slackin.alluxio.io <https://www.alluxio.io/slack> - Bin Fan alluxio.io <http://bit.ly/2JctWrJ> | powered by <http://bit.ly/2JdD0N2> | Data Orchestration Summit 2019 <https://www.alluxio.io/data

Re: Can I set the Alluxio WriteType in Spark applications?

2019-09-19 Thread Bin Fan
ROUGH' \--conf 'spark.executor.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH' \... Hope it helps - Bin On Tue, Sep 17, 2019 at 7:53 AM Mark Zhao wrote: > Hi, > > If Spark applications write data into alluxio, can WriteType be configured? > > Thanks, > Mark > >

Re: How to fix ClosedChannelException

2019-05-16 Thread Bin Fan
/en/reference/Properties-List.html#alluxio.user.network.netty.timeout> in your Spark jobs. Checkout how to run Spark with customized alluxio properties <https://docs.alluxio.io/os/user/stable/en/compute/Spark.html?utm_source=spark&utm_medium=mailinglist> . - Bin On Thu, May 9, 2019 at 4:

Re: How to configure alluxio cluster with spark in yarn

2019-05-16 Thread Bin Fan
Spark/YARN cluster. Here is the documentation <https://docs.alluxio.io/os/user/1.8/en/deploy/Running-Alluxio-On-Yarn.html?utm_source=spark> about deploying Alluxio with YARN. - Bin On Thu, May 9, 2019 at 4:19 AM u9g wrote: > Hey, > > I want to speed up the Spark task runn

Re: cache table vs. parquet table performance

2019-04-17 Thread Bin Fan
separate service (ideally colocated with Spark servers), of course. But also enables data sharing across Spark jobs. - Bin On Tue, Jan 15, 2019 at 10:29 AM Tomas Bartalos wrote: > Hello, > > I'm using spark-thrift server and I'm searching for best performing > solution to

Re: How shall I configure the Spark executor memory size and the Alluxio worker memory size on a machine?

2019-04-04 Thread Bin Fan
g/docs/1.8/en/basic/Web-Interface.html#master-metrics> . If you see lower hit ratio, increase Alluxio storage size and vice versa. Hope this helps, - Bin On Thu, Apr 4, 2019 at 9:29 PM Bin Fan wrote: > Hi Andy, > > It really depends on your workloads. I would suggest to allocate 20% of

Re: How shall I configure the Spark executor memory size and the Alluxio worker memory size on a machine?

2019-04-04 Thread Bin Fan
n/advanced/Alluxio-Storage-Management.html#configuring-alluxio-storage ) - Bin On Thu, Mar 21, 2019 at 8:26 AM u9g wrote: > Hey, > > We have a cluster of 10 nodes each of which consists 128GB memory. We are > about to running Spark and Alluxio on the cluster. We wonder how shall > a

Re: Questions about caching

2018-12-24 Thread Bin Fan
unning df.write.parquet(alluxioFilePath) and your dataframes are stored in Alluxio as parquet files and you can share them with more users. One advantage with Alluxio here is you can manually free the cached data from memory tier or set the TTL for the cached data if you'd like more control on the data.

Re: off heap to alluxio/tachyon in Spark 2

2016-09-19 Thread Bin Fan
) with Alluxio may benefit performance: http://www.alluxio.com/2016/08/effective-spark-rdds-with-alluxio/ - Bin On Mon, Sep 19, 2016 at 7:56 AM, aka.fe2s wrote: > Hi folks, > > What has happened with Tachyon / Alluxio in Spark 2? Doc doesn't mention > it no longer. > > -- > Oleksiy Dyagilev >

PySpark read from HBase

2016-08-12 Thread Bin Wang
ectory? or can it read from HBase in parallel? I don't see that many examples out there so any help or guidance will be appreciated. Also, we are using Cloudera Hadoop so there might be a slight delay with the latest Spark release. Best regards, Bin

Re: Question About OFF_HEAP Caching

2016-07-18 Thread Bin Fan
Here is one blog illustrating how to use Spark on Alluxio for this purpose. Hope it will help: http://www.alluxio.com/2016/04/getting-started-with-alluxio-and-spark/ On Mon, Jul 18, 2016 at 6:36 AM, Gene Pang wrote: > Hi, > > If you want to use Alluxio with Spark 2.x, it is recommended to write

Re: Possible to broadcast a function?

2016-06-29 Thread Bin Fan
-Alluxio.html - Bin On Wed, Jun 29, 2016 at 8:40 AM, Sonal Goyal wrote: > Have you looked at Alluxio? (earlier tachyon) > > Best Regards, > Sonal > Founder, Nube Technologies <http://www.nubetech.co> > Reifier at Strata Hadoop World > <https://www.youtube.com/watch?v=eD3LkpPQ

Re: How to close connection in mapPartitions?

2015-10-22 Thread Bin Wang
BTW, "lines" is a DStream. Bin Wang 于2015年10月23日周五 下午2:16写道: > I use mapPartitions to open connections to Redis, I write it like this: > > val seqs = lines.mapPartitions { lines => > val cache = new RedisCache(redisUrl, redisPort) > val result = lines.

How to close connection in mapPartitions?

2015-10-22 Thread Bin Wang
I use mapPartitions to open connections to Redis, I write it like this: val seqs = lines.mapPartitions { lines => val cache = new RedisCache(redisUrl, redisPort) val result = lines.map(line => Parser.parseBody(line, cache)) cache.redisPool.close result } But it see

Re: Dose spark auto invoke StreamingContext.stop while receive kill signal?

2015-09-23 Thread Bin Wang
the SparkConf > "spark.streaming.stopGracefullyOnShutdown" to "true" > > Note to self, document this in the programming guide. > > On Wed, Sep 23, 2015 at 3:33 AM, Bin Wang wrote: > >> I'd like the spark application to be stoppe

Dose spark auto invoke StreamingContext.stop while receive kill signal?

2015-09-23 Thread Bin Wang
I'd like the spark application to be stopped gracefully while received kill signal, so I add these code: sys.ShutdownHookThread { println("Gracefully stopping Spark Streaming Application") ssc.stop(stopSparkContext = true, stopGracefully = true) println("Application stopped")

Is it possible to merged delayed batches in streaming?

2015-09-23 Thread Bin Wang
I'm using Spark Streaming and there maybe some delays between batches. I'd like to know is it possible to merge delayed batches into one batch to do processing? For example, the interval is set to 5 min but the first batch uses 1 hour, so there are many batches delayed. In the end of processing fo

Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Bin Wang
m DB > 2. By cleaning the checkpoint in between upgrades, data is loaded > only once > > Hope this helps, > -adrian > > From: Bin Wang > Date: Thursday, September 17, 2015 at 11:27 AM > To: Akhil Das > Cc: user > Subject: Re: How to recovery DStream from

Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Bin Wang
j...@mail.gmail.com%3E > > Thanks > Best Regards > > On Thu, Sep 17, 2015 at 10:01 AM, Bin Wang wrote: > >> And here is another question. If I load the DStream from database every >> time I start the job, will the data be loaded when the job is failed and >> auto

Re: How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang
And here is another question. If I load the DStream from database every time I start the job, will the data be loaded when the job is failed and auto restart? If so, both the checkpoint data and database data are loaded, won't this a problem? Bin Wang 于2015年9月16日周三 下午8:40写道: &

Re: How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang
keeper etc) to keep the state (the indexes etc) and then when you deploy > new code they can be easily recovered. > > Thanks > Best Regards > > On Wed, Sep 16, 2015 at 3:52 PM, Bin Wang wrote: > >> I'd like to know if there is a way to recovery dstream from checkpoint

How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang
I'd like to know if there is a way to recovery dstream from checkpoint. Because I stores state in DStream, I'd like the state to be recovered when I restart the application and deploy new code.

Re: How to clear Kafka offset in Spark streaming?

2015-09-14 Thread Bin Wang
I think I've found the reason. It seems that the the smallest offset is not 0 and I should not set the offset to 0. Bin Wang 于2015年9月14日周一 下午2:46写道: > Hi, > > I'm using spark streaming with kafka and I need to clear the offset and > re-compute all things. I deleted checkp

How to clear Kafka offset in Spark streaming?

2015-09-13 Thread Bin Wang
Hi, I'm using spark streaming with kafka and I need to clear the offset and re-compute all things. I deleted checkpoint directory in HDFS and reset kafka offset with "kafka-run-class kafka.tools.ImportZkOffsets". I can confirm the offset is set to 0 in kafka: ~ > kafka-run-class kafka.tools.Consu

Re: Data lost in spark streaming

2015-09-13 Thread Bin Wang
tered receiver for stream 0: Stopped by driver Tathagata Das 于2015年9月13日周日 下午4:05写道: > Maybe the driver got restarted. See the log4j logs of the driver before it > restarted. > > On Thu, Sep 10, 2015 at 11:32 PM, Bin Wang wrote: > >> I'm using spark streaming 1.4.0 and h

Data lost in spark streaming

2015-09-10 Thread Bin Wang
I'm using spark streaming 1.4.0 and have a DStream that have all the data it received. But today the history data in the DStream seems to be lost suddenly. And the application UI also lost the streaming process time and all the related data. Could any give some hint to debug this? Thanks.

Re: Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Bin Wang
l 16, 2015 1:33 PM, "Bin Wang" wrote: > >> If I write code like this: >> >> val rdd = input.map(_.value) >> val f1 = rdd.filter(_ == 1) >> val f2 = rdd.filter(_ == 2) >> ... >> >> Then the DAG of the execution may be this: >> >>

Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Bin Wang
If I write code like this: val rdd = input.map(_.value) val f1 = rdd.filter(_ == 1) val f2 = rdd.filter(_ == 2) ... Then the DAG of the execution may be this: -> Filter -> ... Map -> Filter -> ... But the two filters is operated on the same RDD, which means it could be done by

Re: Spark Streaming Hangs on Start

2015-07-09 Thread Bin Wang
Thanks for the help. I set --executor-cores and it works now. I've used --total-executor-cores and don't realize it changed. Tathagata Das 于2015年7月10日周五 上午3:11写道: > 1. There will be a long running job with description "start()" as that is > the jobs that is running the receivers. It will never e

Spark Streaming Hangs on Start

2015-07-09 Thread Bin Wang
I'm using spark streaming with Kafka, and submit it to YARN cluster with mode "yarn-cluster". But it hangs at SparkContext.start(). The Kafka config is right since it can show some events in "Streaming" tab of web UI. The attached file is the screen shot of the "Jobs" tab of web UI. The code in th

Re: How to submit streaming application and exit

2015-07-08 Thread Bin Wang
for > the lifetime of the streaming app. > > On Wed, Jul 8, 2015 at 1:13 PM, Bin Wang wrote: > >> I'm writing a streaming application and want to use spark-submit to >> submit it to a YARN cluster. I'd like to submit it in a client node and >> exit spar

How to submit streaming application and exit

2015-07-07 Thread Bin Wang
I'm writing a streaming application and want to use spark-submit to submit it to a YARN cluster. I'd like to submit it in a client node and exit spark-submit after the application is running. Is it possible?

Problem Run Spark Example HBase Code Using Spark-Submit

2015-06-25 Thread Bin Wang
am having a hard time adding it to path. This is the final spark-submit command I have but still have the class not found error. Can anyone help me with this? #!/bin/bash export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark /bin/bash $SPARK_HOME/bin/spark-submit \ --master yarn-client

Re: Specify Python interpreter

2015-05-12 Thread Bin Wang
Hi Felix and Tomoas, Thanks a lot for your information. I figured out the environment variable PYSPARK_PYTHON is the secret key. My current approach is to start iPython notebook on the namenode, export PYSPARK_PYTHON=/opt/local/anaconda/bin/ipython /opt/local/anaconda/bin/ipython notebook

Specify Python interpreter

2015-05-11 Thread Bin Wang
ot;#!/opt/local/anaconda" at the top of my Python code and use spark-submit to distribute it to the cluster. However, since I am using iPython notebook, this is not available as an option. Best, Bin

Spark on top of YARN Compression in iPython notebook

2015-05-10 Thread Bin Wang
application running on top of YARN interactively in the iPython notebook: Here is the code that I have written: import sys import os from pyspark import SparkContext, SparkConf sys.path.append('/home/hadoop/myuser/spark-1.3.1-bin-hadoop2.4/python') sys.path.append('/home/hadoop/myuser/

Using Pandas/Scikit Learning in Pyspark

2015-05-08 Thread Bin Wang
stalled on every node. Should I install Anaconda Python on all of them? If so, what is the modern way of managing the Python ecosystem on the cluster? I am a big fan of Python so please guide me. Best regards, Bin

Anaconda iPython notebook working with CDH Spark

2014-12-28 Thread Bin Wang
Python notebook environment. Best regards, Bin

Running time bottleneck on a few worker

2014-08-12 Thread Bin
it related to the partition strategy? For now, I used the default partition strategy. Looking for advice! Thanks very much! Best, Bin

[GraphX] Is it normal to shuffle write 15GB while the data is only 30MB?

2014-08-08 Thread Bin
Hi All, I am running a customized label propagation using Pregel. After a few iterations, the program becomes slow and wastes a lot of time in mapPartitions (at GraphImpl.scala:184 or VertexRDD.scala:318, or VertexRDD.scala:323). And the amount of shuffle write reaches 15GB, while the size of

Re:[GraphX] Can't zip RDDs with unequal numbers of partitions

2014-08-07 Thread Bin
ere any other better solutions? Thanks a lot! Best, Bin 在 2014-08-06 04:54:39,"Bin" 写道: Hi All, Finally I found that the problem occured when I called the graphx lib: " Exception in thread "main" java.lang.IllegalArgumentException: Can't zip R

[GraphX] Can't zip RDDs with unequal numbers of partitions

2014-08-06 Thread Bin
Graph.triplets.foreach(tri=>println()) " Any advice? Thanks a lot! Best, Bin

Can't zip RDDs with unequal numbers of partitions

2014-08-05 Thread Bin
ut I couldn't think of a better way. I am confused how come the partitions were unequal, and how I can control the number of partitions of these RDD. Can someone give me some advice on this problem? Thanks very much! Best, Bin

[GraphX] How spark parameters relate to Pregel implementation

2014-08-04 Thread Bin
. Looking for advice! Thanks a lot! Best, Bin

Re:Re: Re:Re: [GraphX] The best way to construct a graph

2014-07-31 Thread Bin
Thanks for the advice. But since I am not the administrator of our spark cluster, I can't do this. Is there any better solution based on the current spark? At 2014-08-01 02:38:15, "shijiaxin" wrote: >Have you tried to write another similar function like edgeListFile in the >same file, and then

Re:Re: [GraphX] The best way to construct a graph

2014-07-31 Thread Bin
It seems that I cannot specify the weights. I have also tried to imitate GraphLoader.edgeListFile, but I can't call The methods and class used in GraphLoader.edgeListFile. Have you successfully done this? At 2014-08-01 12:47:08, "shijiaxin" wrote: >I think you can try GraphLoader.edgeListFil

Re:Re: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]

2014-07-31 Thread Bin
Hi Haiyang, Thanks, it really is the reason. Best, Bin 在 2014-07-31 08:05:34,"Haiyang Fu" 写道: Have you tried to increase the dirver memory? On Thu, Jul 31, 2014 at 3:54 PM, Bin wrote: Hi All, The data size of my task is about 30mb. It runs smoothly in local mode. Howev

[GraphX] The best way to construct a graph

2014-07-31 Thread Bin
edgeRDD, respectively. Then create the graph using Graph(vertices, edges). I wonder whether there is a better way to do this? Looking for advice! Thanks very much! Best, Bin

java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]

2014-07-31 Thread Bin
-program-hangs-at-job-finished-toarray-workers-throw-java-util-concurren. I also toArray my data, which was the reason of his case. However, how come it runs OK in local but not in the cluster? The memory of each worker is over 60g, and my run command is: "$SPARK_HOME/bin/spark-

[GraphX] How to access a vertex via vertexId?

2014-07-29 Thread Bin
Hi All, I wonder how to access a vertex via its vertexId? I need to get vertex's attributes after running graph algorithm. Thanks very much! Best, Bin

Re: Scala vs Python performance differences

2014-04-14 Thread Bin Wang
At least, Spark Streaming doesn't support Python at this moment, right? On Mon, Apr 14, 2014 at 6:48 PM, Andrew Ash wrote: > Hi Spark users, > > I've always done all my Spark work in Scala, but occasionally people ask > about Python and its performance impact vs the same algorithm > implementat

Re: Spark and HBase

2014-04-08 Thread Bin Wang
uster with Hbase preconfigured and give it a try. Sorry cannot provide more detailed explanation and help. On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier wrote: > Thanks for the quick reply Bin. Phenix is something I'm going to try for > sure but is seems somehow useless if I can

Re: Spark and HBase

2014-04-08 Thread Bin Wang
group and the "stats" functions spark has already implemented are still on the roadmap. I am not sure whether it will be good but might be something interesting to check out. /usr/bin On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier wrote: > Hi to everybody, > > in the

Spark Streaming Maven Build

2014-03-04 Thread Bin Wang
? assembly-plugin?..etc) 2. mvn install or mvn clean install or mvn install compile assembly:single? 3. after you have a jar file, then how do you execute the jar file instead of using bin/run-example... To answer those people who might ask what you have done (Here is a derivative from the

Re: Missing Spark URL after staring the master

2014-03-04 Thread Bin Wang
uster... Thanks, Bin On Tue, Mar 4, 2014 at 10:59 AM, Mayur Rustagi wrote: > I have on cloudera vm > http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_Cloudera_VM > which version are you trying to setup on cloudera.. also which cloudera > version are you using... &g

Re: Missing Spark URL after staring the master

2014-03-03 Thread Bin Wang
that you have done since you've already made it! Bin On Mon, Mar 3, 2014 at 2:06 PM, Ognen Duzlevski wrote: > I should add that in this setup you really do not need to look for the > printout of the master node's IP - you set it yourself a priori. If anyone > is interested

Missing Spark URL after staring the master

2014-03-03 Thread Bin Wang
Hi there, I have a CDH cluster set up, and I tried using the Spark parcel come with Cloudera Manager, but it turned out they even don't have the run-example shell command in the bin folder. Then I removed it from the cluster and cloned the incubator-spark into the name node of my cluster