Re: cache table vs. parquet table performance

2019-04-17 Thread Bin Fan
Hi Tomas, One option is to cache your table as Parquet files into Alluxio (which can serve as an in-memory distributed caching layer for Spark in your case). The code on Spark will be like > df.write.parquet("alluxio://master:19998/data.parquet")> df = >

Re: Boto3 library send to pyspark

2019-04-17 Thread Gourav Sengupta
Hi, not sure about different environment but it is something that you can look into as well. Why would you need different environments for your job? Regards, Gourav On Wed, Apr 17, 2019 at 11:43 AM Gorka Bravo Martinez < gorka.bravo.marti...@cern.ch> wrote: > Hi Gourav, > > you mean by seting

Re: Spark SQL API taking longer time than DF API.

2019-04-17 Thread Yeikel
Please share the results of df.explain()[1] for both. That should give us some clues of what the differences are [1]https://github.com/apache/spark/blob/e1c90d66bbea5b4cb97226610701b0389b734651/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L499 -- Sent from:

Re: Parallelize Join Problem

2019-04-17 Thread asma zgolli
How can I figure out if the data is skewed ? are there some statistics i can check ? Le mer. 17 avr. 2019 à 20:12, Yeikel a écrit : > It is hard to tell , but your data may be skewed > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > >

Re: Spark job running for long time

2019-04-17 Thread Yeikel
Can you share the output of df.explain() ? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Parallelize Join Problem

2019-04-17 Thread Yeikel
It is hard to tell , but your data may be skewed -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark job running for long time

2019-04-17 Thread rajat kumar
Hi , Thanks for response! We are doing 12 left outer joins. Also I see GC is colored as red in Spark UI. It seems GC is also taking time. We have tried using kyro serialization. Tried giving more memory to executor as well as driver. But it didn't work. On Wed, 17 Apr 2019, 23:35 Yeikel

Re: Spark job running for long time

2019-04-17 Thread Yeikel
We need more information about your job to be able to help you. Please share some snippets or the overall idea of what you are doing -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Spark job running for long time

2019-04-17 Thread rajat kumar
Hi All, One of my containers is still running for long time. In logs it is showing "Thread 240 spilling sort data of 10.4 GB to disk". This is happening every minute. Thanks Rajat

Re: Boto3 library send to pyspark

2019-04-17 Thread Sebastian Schere
Unsubscribe On Wed, 17 Apr 2019 at 07:43 Gorka Bravo Martinez < gorka.bravo.marti...@cern.ch> wrote: > Hi Gourav, > > you mean by seting a different python environment while running pyspark? > > Cheers, Gorka. > > From: Gourav Sengupta

RE: Boto3 library send to pyspark

2019-04-17 Thread Gorka Bravo Martinez
Hi Gourav, you mean by seting a different python environment while running pyspark? Cheers, Gorka. From: Gourav Sengupta [gourav.sengu...@gmail.com] Sent: 17 April 2019 10:06 To: Gorka Bravo Martinez Cc: user@spark.apache.org Subject: Re: Boto3 library

Re: Boto3 library send to pyspark

2019-04-17 Thread Gourav Sengupta
Hi, there is addPyFile, and then there is python environment, try to search for using python package managers like canopy and conda. Regards, Gourav On Wed, Apr 17, 2019 at 8:50 AM Gorka Bravo Martinez < gorka.bravo.marti...@cern.ch> wrote: > Hi all, > > I would like to send a boto/boto3

Boto3 library send to pyspark

2019-04-17 Thread Gorka Bravo Martinez
Hi all, I would like to send a boto/boto3 library while running pyspark with yarn client mode, how is it possible? I am aware sc.addFile() can add a .py file, is it the same for a library? Cheers, Gorka.