I have pretty much the same "symptoms" - the computation itself is pretty fast, but most of my computation is spent in JavaToPython steps (~15min). I'm using the Spark 1.5.0-rc1 with DataFrame and ML Pipelines. Any insights into what these steps are exactly ?
2015-06-02 9:18 GMT+02:00 Karlson <ksonsp...@siberie.de>: > Hi, the code is some hundreds lines of Python. I can try to compose a > minimal example as soon as I find the time, though. Any ideas until then? > > > Would you mind posting the code? >> On 2 Jun 2015 00:53, "Karlson" <ksonsp...@siberie.de> wrote: >> >> Hi, >>> >>> In all (pyspark) Spark jobs, that become somewhat more involved, I am >>> experiencing the issue that some stages take a very long time to complete >>> and sometimes don't at all. This clearly correlates with the size of my >>> input data. Looking at the stage details for one such stage, I am >>> wondering >>> where Spark spends all this time. Take this table of the stages task >>> metrics for example: >>> >>> Metric Min 25th >>> percentile Median 75th percentile Max >>> Duration 1.4 min 1.5 min 1.7 min >>> 1.9 min 2.3 min >>> Scheduler Delay 1 ms 3 ms 4 ms >>> 5 ms 23 ms >>> Task Deserialization Time 1 ms 2 ms 3 ms >>> 8 ms 22 ms >>> GC Time 0 ms 0 ms 0 ms >>> 0 ms 0 ms >>> Result Serialization Time 0 ms 0 ms 0 ms >>> 0 ms 1 ms >>> Getting Result Time 0 ms 0 ms 0 ms >>> 0 ms 0 ms >>> Input Size / Records 23.9 KB / 1 24.0 KB / 1 24.1 KB / >>> 1 24.1 KB / 1 24.3 KB / 1 >>> >>> Why is the overall duration almost 2min? Where is all this time spent, >>> when no progress of the stages is visible? The progress bar simply >>> displays >>> 0 succeeded tasks for a very long time before sometimes slowly >>> progressing. >>> >>> Also, the name of the stage displayed above is `javaToPython at null:-1`, >>> which I find very uninformative. I don't even know which action exactly >>> is >>> responsible for this stage. Does anyone experience similar issues or have >>> any advice for me? >>> >>> Thanks! >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >>> > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Olivier Girardot* | AssociƩ o.girar...@lateral-thoughts.com +33 6 24 09 17 94