Re: Spark stages very slow to complete

2015-08-25 Thread Olivier Girardot
I have pretty much the same symptoms - the computation itself is pretty
fast, but most of my computation is spent in JavaToPython steps (~15min).
I'm using the Spark 1.5.0-rc1 with DataFrame and ML Pipelines.
Any insights into what these steps are exactly ?

2015-06-02 9:18 GMT+02:00 Karlson ksonsp...@siberie.de:

 Hi, the code is some hundreds lines of Python. I can try to compose a
 minimal example as soon as I find the time, though. Any ideas until then?


 Would you mind posting the code?
 On 2 Jun 2015 00:53, Karlson ksonsp...@siberie.de wrote:

 Hi,

 In all (pyspark) Spark jobs, that become somewhat more involved, I am
 experiencing the issue that some stages take a very long time to complete
 and sometimes don't at all. This clearly correlates with the size of my
 input data. Looking at the stage details for one such stage, I am
 wondering
 where Spark spends all this time. Take this table of the stages task
 metrics for example:

 Metric  Min 25th
 percentile  Median  75th percentile Max
 Duration1.4 min 1.5 min 1.7 min
  1.9 min 2.3 min
 Scheduler Delay 1 ms3 ms4 ms
   5 ms23 ms
 Task Deserialization Time   1 ms2 ms3 ms
   8 ms22 ms
 GC Time 0 ms0 ms0 ms
   0 ms0 ms
 Result Serialization Time   0 ms0 ms0 ms
   0 ms1 ms
 Getting Result Time 0 ms0 ms0 ms
   0 ms0 ms
 Input Size / Records23.9 KB / 1 24.0 KB / 1 24.1 KB /
 1 24.1 KB / 1 24.3 KB / 1

 Why is the overall duration almost 2min? Where is all this time spent,
 when no progress of the stages is visible? The progress bar simply
 displays
 0 succeeded tasks for a very long time before sometimes slowly
 progressing.

 Also, the name of the stage displayed above is `javaToPython at null:-1`,
 which I find very uninformative. I don't even know which action exactly
 is
 responsible for this stage. Does anyone experience similar issues or have
 any advice for me?

 Thanks!

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
*Olivier Girardot* | AssociƩ
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94


Re: Spark stages very slow to complete

2015-06-02 Thread Karlson
Hi, the code is some hundreds lines of Python. I can try to compose a 
minimal example as soon as I find the time, though. Any ideas until 
then?



Would you mind posting the code?
On 2 Jun 2015 00:53, Karlson ksonsp...@siberie.de wrote:


Hi,

In all (pyspark) Spark jobs, that become somewhat more involved, I am
experiencing the issue that some stages take a very long time to 
complete
and sometimes don't at all. This clearly correlates with the size of 
my
input data. Looking at the stage details for one such stage, I am 
wondering

where Spark spends all this time. Take this table of the stages task
metrics for example:

Metric  Min 25th
percentile  Median  75th percentile Max
Duration1.4 min 1.5 min 1.7 
min

 1.9 min 2.3 min
Scheduler Delay 1 ms3 ms4 ms
  5 ms23 ms
Task Deserialization Time   1 ms2 ms3 ms
  8 ms22 ms
GC Time 0 ms0 ms0 ms
  0 ms0 ms
Result Serialization Time   0 ms0 ms0 ms
  0 ms1 ms
Getting Result Time 0 ms0 ms0 ms
  0 ms0 ms
Input Size / Records23.9 KB / 1 24.0 KB / 1 24.1 
KB /

1 24.1 KB / 1 24.3 KB / 1

Why is the overall duration almost 2min? Where is all this time spent,
when no progress of the stages is visible? The progress bar simply 
displays
0 succeeded tasks for a very long time before sometimes slowly 
progressing.


Also, the name of the stage displayed above is `javaToPython at 
null:-1`,
which I find very uninformative. I don't even know which action 
exactly is
responsible for this stage. Does anyone experience similar issues or 
have

any advice for me?

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark stages very slow to complete

2015-06-01 Thread ayan guha
Would you mind posting the code?
On 2 Jun 2015 00:53, Karlson ksonsp...@siberie.de wrote:

 Hi,

 In all (pyspark) Spark jobs, that become somewhat more involved, I am
 experiencing the issue that some stages take a very long time to complete
 and sometimes don't at all. This clearly correlates with the size of my
 input data. Looking at the stage details for one such stage, I am wondering
 where Spark spends all this time. Take this table of the stages task
 metrics for example:

 Metric  Min 25th
 percentile  Median  75th percentile Max
 Duration1.4 min 1.5 min 1.7 min
  1.9 min 2.3 min
 Scheduler Delay 1 ms3 ms4 ms
   5 ms23 ms
 Task Deserialization Time   1 ms2 ms3 ms
   8 ms22 ms
 GC Time 0 ms0 ms0 ms
   0 ms0 ms
 Result Serialization Time   0 ms0 ms0 ms
   0 ms1 ms
 Getting Result Time 0 ms0 ms0 ms
   0 ms0 ms
 Input Size / Records23.9 KB / 1 24.0 KB / 1 24.1 KB /
 1 24.1 KB / 1 24.3 KB / 1

 Why is the overall duration almost 2min? Where is all this time spent,
 when no progress of the stages is visible? The progress bar simply displays
 0 succeeded tasks for a very long time before sometimes slowly progressing.

 Also, the name of the stage displayed above is `javaToPython at null:-1`,
 which I find very uninformative. I don't even know which action exactly is
 responsible for this stage. Does anyone experience similar issues or have
 any advice for me?

 Thanks!

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org