Hi,
In all (pyspark) Spark jobs, that become somewhat more involved, I am
experiencing the issue that some stages take a very long time to
complete and sometimes don't at all. This clearly correlates with the
size of my input data. Looking at the stage details for one such stage,
I am wondering where Spark spends all this time. Take this table of the
stages task metrics for example:
Metric Min 25th percentile
Median 75th percentile Max
Duration 1.4 min 1.5 min 1.7 min
1.9 min 2.3 min
Scheduler Delay 1 ms 3 ms 4 ms
5 ms 23 ms
Task Deserialization Time 1 ms 2 ms 3 ms
8 ms 22 ms
GC Time 0 ms 0 ms 0 ms
0 ms 0 ms
Result Serialization Time 0 ms 0 ms 0 ms
0 ms 1 ms
Getting Result Time 0 ms 0 ms 0 ms
0 ms 0 ms
Input Size / Records 23.9 KB / 1 24.0 KB / 1 24.1 KB / 1 24.1 KB /
1 24.3 KB / 1
Why is the overall duration almost 2min? Where is all this time spent,
when no progress of the stages is visible? The progress bar simply
displays 0 succeeded tasks for a very long time before sometimes slowly
progressing.
Also, the name of the stage displayed above is `javaToPython at
null:-1`, which I find very uninformative. I don't even know which
action exactly is responsible for this stage. Does anyone experience
similar issues or have any advice for me?
Thanks!
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org