Hi Tomas,
One option is to cache your table as Parquet files into Alluxio (which can
serve as an in-memory distributed caching layer for Spark in your case).
The code on Spark will be like
> df.write.parquet("alluxio://master:19998/data.parquet")> df =
>
Hi,
not sure about different environment but it is something that you can look
into as well. Why would you need different environments for your job?
Regards,
Gourav
On Wed, Apr 17, 2019 at 11:43 AM Gorka Bravo Martinez <
gorka.bravo.marti...@cern.ch> wrote:
> Hi Gourav,
>
> you mean by seting
Please share the results of df.explain()[1] for both. That should give us
some clues of what the differences are
[1]https://github.com/apache/spark/blob/e1c90d66bbea5b4cb97226610701b0389b734651/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L499
--
Sent from:
How can I figure out if the data is skewed ? are there some statistics i
can check ?
Le mer. 17 avr. 2019 à 20:12, Yeikel a écrit :
> It is hard to tell , but your data may be skewed
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
>
Can you share the output of df.explain() ?
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
It is hard to tell , but your data may be skewed
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Hi ,
Thanks for response!
We are doing 12 left outer joins. Also I see GC is colored as red in Spark
UI. It seems GC is also taking time.
We have tried using kyro serialization. Tried giving more memory to
executor as well as driver. But it didn't work.
On Wed, 17 Apr 2019, 23:35 Yeikel
We need more information about your job to be able to help you. Please share
some snippets or the overall idea of what you are doing
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail:
Hi All,
One of my containers is still running for long time.
In logs it is showing "Thread 240 spilling sort data of 10.4 GB to disk".
This is happening every minute.
Thanks
Rajat
Unsubscribe
On Wed, 17 Apr 2019 at 07:43 Gorka Bravo Martinez <
gorka.bravo.marti...@cern.ch> wrote:
> Hi Gourav,
>
> you mean by seting a different python environment while running pyspark?
>
> Cheers, Gorka.
>
> From: Gourav Sengupta
Hi Gourav,
you mean by seting a different python environment while running pyspark?
Cheers, Gorka.
From: Gourav Sengupta [gourav.sengu...@gmail.com]
Sent: 17 April 2019 10:06
To: Gorka Bravo Martinez
Cc: user@spark.apache.org
Subject: Re: Boto3 library
Hi,
there is addPyFile, and then there is python environment, try to search for
using python package managers like canopy and conda.
Regards,
Gourav
On Wed, Apr 17, 2019 at 8:50 AM Gorka Bravo Martinez <
gorka.bravo.marti...@cern.ch> wrote:
> Hi all,
>
> I would like to send a boto/boto3
Hi all,
I would like to send a boto/boto3 library while running pyspark with yarn
client mode, how is it possible?
I am aware sc.addFile() can add a .py file, is it the same for a library?
Cheers, Gorka.
13 matches
Mail list logo