Hi Maciej,

Thanks for the info you provided.

I tried to run the same example with 1.6 and current branch and record the
difference between the time cost on preparing the executed plan.

Current branch:

292 ms                                                                          
   
95 ms                             
57 ms
34 ms
128 ms
120 ms
63 ms
106 ms
179 ms
159 ms
235 ms
260 ms
334 ms
464 ms
547 ms                             
719 ms
942 ms
1130 ms
1928 ms
1751 ms
2159 ms                            
2767 ms
3333 ms
4175 ms
5106 ms
6269 ms
7683 ms
9210 ms
10931 ms
13237 ms
15651 ms
19222 ms
23841 ms
26135 ms
31299 ms
38437 ms
47392 ms
51420 ms
60285 ms
69840 ms
74294 ms

1.6:

3 ms
4 ms
10 ms
4 ms
17 ms
8 ms
12 ms
21 ms
15 ms
15 ms
19 ms
23 ms
28 ms
28 ms
58 ms
39 ms
43 ms
61 ms
56 ms
60 ms
81 ms
73 ms
100 ms
91 ms
96 ms
116 ms
111 ms
140 ms
127 ms
142 ms
148 ms
165 ms
171 ms
198 ms
200 ms
233 ms
237 ms
253 ms
256 ms
271 ms
292 ms
452 ms

Although they both take more time after each iteration due to the grown
query plan, it is obvious that current branch takes much more time than 1.6
branch. The optimizer and query planning in current branch is much more
complicated than 1.6.


zero323 wrote
> Hi Liang-Chi,
> 
> Thank you for your answer and PR but what I think I wasn't specific
> enough. In hindsight I should have illustrate this better. What really
> troubles me here is a pattern of growing delays. Difference between
> 1.6.3 (roughly 20s runtime since the first job):
> 
> 
> 1.6 timeline
> 
> vs 2.1.0 (45 minutes or so in a bad case):
> 
> 2.1.0 timeline
> 
> The code is just an example and it is intentionally dumb. You easily
> mask this with caching, or using significantly larger data sets. So I
> guess the question I am really interested in is - what changed between
> 1.6.3 and 2.x (this is more or less consistent across 2.0, 2.1 and
> current master) to cause this and more important, is it a feature or is
> it a bug? I admit, I choose a lazy path here, and didn't spend much time
> (yet) trying to dig deeper.
> 
> I can see a bit higher memory usage, a bit more intensive GC activity,
> but nothing I would really blame for this behavior, and duration of
> individual jobs is comparable with some favor of 2.x. Neither
> StringIndexer nor OneHotEncoder changed much in 2.x. They used RDDs for
> fitting in 1.6 and, as far as I can tell, they still do that in 2.x. And
> the problem doesn't look that related to the data processing part in the
> first place.
> 
> 
> On 02/02/2017 07:22 AM, Liang-Chi Hsieh wrote:
>> Hi Maciej,
>>
>> FYI, the PR is at https://github.com/apache/spark/pull/16775.
>>
>>
>> Liang-Chi Hsieh wrote
>>> Hi Maciej,
>>>
>>> Basically the fitting algorithm in Pipeline is an iterative operation.
>>> Running iterative algorithm on Dataset would have RDD lineages and query
>>> plans that grow fast. Without cache and checkpoint, it gets slower when
>>> the iteration number increases.
>>>
>>> I think it is why when you run a Pipeline with long stages, it gets much
>>> longer time to finish. As I think it is not uncommon to have long stages
>>> in a Pipeline, we should improve this. I will submit a PR for this.
>>> zero323 wrote
>>>> Hi everyone,
>>>>
>>>> While experimenting with ML pipelines I experience a significant
>>>> performance regression when switching from 1.6.x to 2.x.
>>>>
>>>> import org.apache.spark.ml.{Pipeline, PipelineStage}
>>>> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
>>>> VectorAssembler}
>>>>
>>>> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
>>>> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
>>>> val indexers = df.columns.tail.map(c => new StringIndexer()
>>>>   .setInputCol(c)
>>>>   .setOutputCol(s"${c}_indexed")
>>>>   .setHandleInvalid("skip"))
>>>>
>>>> val encoders = indexers.map(indexer => new OneHotEncoder()
>>>>   .setInputCol(indexer.getOutputCol)
>>>>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
>>>>   .setDropLast(true))
>>>>
>>>> val assembler = new
>>>> VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
>>>> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler
>>>>
>>>> new Pipeline().setStages(stages).fit(df).transform(df).show
>>>>
>>>> Task execution time is comparable and executors are most of the time
>>>> idle so it looks like it is a problem with the optimizer. Is it a known
>>>> issue? Are there any changes I've missed, that could lead to this
>>>> behavior?
>>>>
>>>> -- 
>>>> Best,
>>>> Maciej
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: 
>>>> dev-unsubscribe@.apache
>>
>>
>>
>>
>> -----
>> Liang-Chi Hsieh | @viirya 
>> Spark Technology Center 
>> http://www.spark.tc/ 
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
> 
> -- 
> Maciej Szymkiewicz
> 
> 
> 
> nM15AWH.png (19K)
> <http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/20827/0/nM15AWH.png>
> KHZa7hL.png (26K)
> <http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/20827/1/KHZa7hL.png>





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20829.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to