Would you mind posting the code? On 2 Jun 2015 00:53, "Karlson" <ksonsp...@siberie.de> wrote:
> Hi, > > In all (pyspark) Spark jobs, that become somewhat more involved, I am > experiencing the issue that some stages take a very long time to complete > and sometimes don't at all. This clearly correlates with the size of my > input data. Looking at the stage details for one such stage, I am wondering > where Spark spends all this time. Take this table of the stages task > metrics for example: > > Metric Min 25th > percentile Median 75th percentile Max > Duration 1.4 min 1.5 min 1.7 min > 1.9 min 2.3 min > Scheduler Delay 1 ms 3 ms 4 ms > 5 ms 23 ms > Task Deserialization Time 1 ms 2 ms 3 ms > 8 ms 22 ms > GC Time 0 ms 0 ms 0 ms > 0 ms 0 ms > Result Serialization Time 0 ms 0 ms 0 ms > 0 ms 1 ms > Getting Result Time 0 ms 0 ms 0 ms > 0 ms 0 ms > Input Size / Records 23.9 KB / 1 24.0 KB / 1 24.1 KB / > 1 24.1 KB / 1 24.3 KB / 1 > > Why is the overall duration almost 2min? Where is all this time spent, > when no progress of the stages is visible? The progress bar simply displays > 0 succeeded tasks for a very long time before sometimes slowly progressing. > > Also, the name of the stage displayed above is `javaToPython at null:-1`, > which I find very uninformative. I don't even know which action exactly is > responsible for this stage. Does anyone experience similar issues or have > any advice for me? > > Thanks! > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >