Spark stages very slow to complete

Karlson Mon, 01 Jun 2015 07:53:39 -0700

Hi,

In all (pyspark) Spark jobs, that become somewhat more involved, I amexperiencing the issue that some stages take a very long time tocomplete and sometimes don't at all. This clearly correlates with thesize of my input data. Looking at the stage details for one such stage,I am wondering where Spark spends all this time. Take this table of thestages task metrics for example:


Metric                          Min             25th            percentile      
Median          75th percentile Max
Duration                        1.4 min         1.5 min         1.7 min         
1.9 min         2.3 min
Scheduler Delay                 1 ms            3 ms            4 ms            
5 ms            23 ms
Task Deserialization Time       1 ms            2 ms            3 ms            
8 ms            22 ms
GC Time                         0 ms            0 ms            0 ms            
0 ms            0 ms
Result Serialization Time       0 ms            0 ms            0 ms            
0 ms            1 ms
Getting Result Time             0 ms            0 ms            0 ms            
0 ms            0 ms

Input Size / Records 23.9 KB / 1 24.0 KB / 1 24.1 KB / 1 24.1 KB /1 24.3 KB / 1

Why is the overall duration almost 2min? Where is all this time spent,when no progress of the stages is visible? The progress bar simplydisplays 0 succeeded tasks for a very long time before sometimes slowlyprogressing.

Also, the name of the stage displayed above is `javaToPython atnull:-1`, which I find very uninformative. I don't even know whichaction exactly is responsible for this stage. Does anyone experiencesimilar issues or have any advice for me?


Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Spark stages very slow to complete

Reply via email to