Re: How to fix ClosedChannelException

2019-05-16 Thread Bin Fan
Hi This *java.nio.channels.ClosedChannelException* is often caused by a connection timeout between your Spark executors and Alluxio workers. One simple and quick fix is to increase the timeout value to be larger alluxio.user.network.netty.timeout

Re: How to configure alluxio cluster with spark in yarn

2019-05-16 Thread Bin Fan
hi Andy Assuming you are running Spark with YARN, then I would recommend deploying Alluxio in the same YARN cluster if you are looking for best performance. Alluxio can also be deployed separated as a standalone service, but in that case, you may need to transfer data from Alluxio cluster to your

Is there a difference between --proxy-user or HADOOP_USER_NAME in a non-Kerberized YARN cluster?

2019-05-16 Thread Jeff Evans
Let's suppose we're dealing with a non-secured (i.e. not Kerberized) YARN cluster. When I invoke spark-submit, is there a practical difference between specifying --proxy-user=foo (supposing impersonation is properly set up) or setting the environment variable HADOOP_USER_NAME=foo? Thanks for any

Re: how to get spark-sql lineage

2019-05-16 Thread Arun Mahadevan
You can check out https://github.com/hortonworks-spark/spark-atlas-connector/ On Wed, 15 May 2019 at 19:44, lk_spark wrote: > hi,all: > When I use spark , if I run some SQL to do ETL how can I get > lineage info. I found that , CDH spark have some config about lineage : >

GraphX parameters tuning

2019-05-16 Thread muaz-32
Hi everyone. I am doing my master thesis in the topic of Automatic parameter tuning of graph processing frameworks. Now, we are aiming to optimize GraphX jobs. I have an initial list of parameters which we would like to tune: spark.memory.fraction spark.executor.memory spark.shuffle.compress

Re: Spark job gets hung on cloudera cluster

2019-05-16 Thread Akshay Bhardwaj
One of the reason that any jobs running on YARN (Spark, MR, Hive, etc) can get stuck is if there is data unavailability issue with HDFS. This can arise if either the Namenode is not reachable or if the particular data block is unavailable due to node failures. Can you check if your YARN service

GC overhead while read a table partition from HIVE

2019-05-16 Thread Shivam Sharma
Hi All, I am getting GC overhead while reading a table from HIVE from spark like: spark.sql("SELECT * FROM some.table where date='2019-05-14' LIMIT > 10").show() So when I run above command in spark-shell then it starts processing *1780 tasks* where it goes OOM at a specific partition. 1.

Re: Are Spark Dataframes mutable in Structured Streaming?

2019-05-16 Thread Russell Spitzer
You are looking at the digram without looking at the underlying request. The behavior of state collection is dependent on the request and the output mode of the query. In the example you cite val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", )

Re: how to get spark-sql lineage

2019-05-16 Thread Gabor Somogyi
Hi, spark.lineage.enabled is Cloudera specific and doesn't work with vanilla Spark. BR, G On Thu, May 16, 2019 at 4:44 AM lk_spark wrote: > hi,all: > When I use spark , if I run some SQL to do ETL how can I get > lineage info. I found that , CDH spark have some config about lineage :

Re: Spark job gets hung on cloudera cluster

2019-05-16 Thread Rishi Shah
on yarn On Thu, May 16, 2019 at 1:36 AM Akshay Bhardwaj < akshay.bhardwaj1...@gmail.com> wrote: > Hi Rishi, > > Are you running spark on YARN or spark's master-slave cluster? > > Akshay Bhardwaj > +91-97111-33849 > > > On Thu, May 16, 2019 at 7:15 AM Rishi Shah > wrote: > >> Any one please? >>

Re: Databricks - number of executors, shuffle.partitions etc

2019-05-16 Thread Rishi Shah
Thanks Ayan, I wasn't aware of such user group specifically for databricks. Thanks for the input, much appreciated! On Wed, May 15, 2019 at 10:07 PM ayan guha wrote: > Well its a databricks question so better be asked in their forum. > > You can set up cluster level params when you create new

[Structured Streaming]: Are Spark Dataframes mutable in Structured Streaming?

2019-05-16 Thread Sheel Pancholi
Tagging mail to hopefully get a quicker response On Thu 16 May, 2019, 3:08 PM Sheel Pancholi, wrote: > Hello, > > Along with what I sent before, I want to add that I went over the > documentation at > https://github.com/apache/spark/blob/master/docs/structured-streaming-programming-guide.md > >

Re: Are Spark Dataframes mutable in Structured Streaming?

2019-05-16 Thread Sheel Pancholi
Hello, Along with what I sent before, I want to add that I went over the documentation at https://github.com/apache/spark/blob/master/docs/structured-streaming-programming-guide.md Here is an excerpt: [image: Model] >

Re: Are Spark Dataframes mutable in Structured Streaming?

2019-05-16 Thread Sheel Pancholi
Hello Russell, Thanks for clarifying. I went over the Catalyst Optimizer Deep Dive video at https://www.youtube.com/watch?v=RmUn5vHlevc and that along with your explanation made me realize that the the DataFrame is the new DStream in Structured Streaming. If my understanding is correct, request