Hi Akash,
Glad to know that repartition helped!
The overall tasks actually depends on the kind of operations you are
performing and also on how the DF is partitioned.
I can't comment on the former but can provide some pointers on the latter.
Default value of spark.sql.shuffle.partitions is 200.
Hi Srinath,
Thanks for such an elaborate reply. How to reduce the number of overall
tasks?
I found, after simply repartitioning the csv file into 8 parts and
converting it to parquet with snappy compression, helped not only in even
distribution of the tasks on all nodes, but also helped in bringi
Hi Aakash,
Can you check the logs for Executor ID 0? It was restarted on worker
192.168.49.39 perhaps due to OOM or something.
Also observed that the number of tasks are high and unevenly distributed
across the workers.
Check if there are too many partitions in the RDD and tune it using
spark.sql
Yes, but when I did increase my executor memory, the spark job is going to
halt after running a few steps, even though, the executor isn't dying.
Data - 60,000 data-points, 230 columns (60 MB data).
Any input on why it behaves like that?
On Tue, Jun 12, 2018 at 8:15 AM, Vamshi Talla wrote:
> A
Aakash,
Like Jorn suggested, did you increase your test data set? If so, did you also
update your executor-memory setting? It seems like you might exceeding the
executor memory threshold.
Thanks
Vamshi Talla
Sent from my iPhone
On Jun 11, 2018, at 8:54 AM, Aakash Basu
mailto:aakash.spark
Hi Jorn/Others,
Thanks for your help. Now, data is being distributed in a proper way, but
the challenge is, after a certain point, I'm getting this error, after
which, everything stops moving ahead -
2018-06-11 18:14:56 ERROR TaskSchedulerImpl:70 - Lost executor 0 on
192.168.49.39: Remote RPC cli
If it is in kB then spark will always schedule it to one node. As soon as it
gets bigger you will see usage of more nodes.
Hence increase your testing Dataset .
> On 11. Jun 2018, at 12:22, Aakash Basu wrote:
>
> Jorn - The code is a series of feature engineering and model tuning
> operations
try
--num-executors 3 --executor-cores 4 --executor-memory 2G --conf
spark.scheduler.mode=FAIR
On Mon, Jun 11, 2018 at 2:43 PM, Aakash Basu
wrote:
> Hi,
>
> I have submitted a job on* 4 node cluster*, where I see, most of the
> operations happening at one of the worker nodes and other two are s
What is your code ? Maybe this one does an operation which is bound to a single
host or your data volume is too small for multiple hosts.
> On 11. Jun 2018, at 11:13, Aakash Basu wrote:
>
> Hi,
>
> I have submitted a job on 4 node cluster, where I see, most of the operations
> happening at on
Hi,
I have submitted a job on* 4 node cluster*, where I see, most of the
operations happening at one of the worker nodes and other two are simply
chilling out.
Picture below puts light on that -
How to properly distribute the load?
My cluster conf (4 node cluster [1 driver; 3 slaves]) -
*Cores
Thanks for your reply.
It is 64GB per node. We will try using UseParallelGC.
From: CPC [mailto:acha...@gmail.com]
Sent: Thursday, April 26, 2018 11:44 PM
To: vincent gromakowski
Cc: Pallavi Singh ; user
Subject: Re: Spark Optimization
I would recommend UseParallelGC since this is a batch job
I would recommend UseParallelGC since this is a batch job. Parallelization
should be 2-3x of cores. Also if those are physical machines i would
recommend 9000 as network mtu. Is 128 gb per node or 64 gb per node?
On Thu, Apr 26, 2018, 7:40 PM vincent gromakowski <
vincent.gromakow...@gmail.com> wr
Ideal parallelization is 2-3x the nb of cores. But it depends on the number
of partitions of your source and the operation you use (Shuffle or not). It
can be worth paying the extra cost of an initial repartition to match your
cluster but it clearly depends on your DAG.
Optimizing spark apps depend
Hi Team,
We are currently working on POC based on Spark and Scala.
we have to read 18million records from parquet file and perform the 25 user
defined aggregation based on grouping keys.
we have used spark high level Dataframe API for the aggregation. On cluster of
two node we could finish end t
Why those two stages in apache spark are computing same thing?
<http://stackoverflow.com/questions/40192302/why-those-two-stages-in-apache-spark-are-computing-same-thing>
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-optimization-tp28034.htm
Hi,
I have a query regarding spark stage optimization. I have asked the
question in more detail at Stackoverflow, please find the following link:
http://stackoverflow.com/questions/40192302/why-is-
that-two-stages-in-apache-spark-are-computing-same-thing
that shows those capabilities which you can
find here : https://ibm.app.box.com/s/vyaedlyb444a4zna1215c7puhxliqxdg
There is a blog post which gives more details on the functionality here:
www.spark.tc/sparkoscope-enabling-spark-optimization-through-cross-stack-monitoring-and-visualization-2/
and
-dev
If you want to guarantee the side effects happen you should use foreach or
foreachPartitions. A `take`, for example, might only evaluate a subset of
the partitions until it find enough results.
On Wed, Aug 12, 2015 at 7:06 AM, Eugene Morozov wrote:
> Hi!
>
> I’d like to complete action (s
Hi!
I’d like to complete action (store / print smth) inside of transformation (map
or mapPartitions). This approach has some flaws, but there is a question. Might
it happen that Spark will optimise (RDD or DataFrame) processing so that my
mapPartitions simply won’t happen?
--
Eugene Morozov
fa
send map
> output locations for shuffle 0 to sp...@spark-s4.test.org:34546
>
> Best regards,
>
> Morbious
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spar
fle 0 to sp...@spark-s4.test.org:34546
Best regards,
Morbious
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-optimization-tp17290.html
Sent from the Apache Spark User List mailing list archive at Nabbl
21 matches
Mail list logo