Hi Liang-Chi,

Thank you for your answer and PR but what I think I wasn't specific
enough. In hindsight I should have illustrate this better. What really
troubles me here is a pattern of growing delays. Difference between
1.6.3 (roughly 20s runtime since the first job):


1.6 timeline

vs 2.1.0 (45 minutes or so in a bad case):

2.1.0 timeline

The code is just an example and it is intentionally dumb. You easily
mask this with caching, or using significantly larger data sets. So I
guess the question I am really interested in is - what changed between
1.6.3 and 2.x (this is more or less consistent across 2.0, 2.1 and
current master) to cause this and more important, is it a feature or is
it a bug? I admit, I choose a lazy path here, and didn't spend much time
(yet) trying to dig deeper.

I can see a bit higher memory usage, a bit more intensive GC activity,
but nothing I would really blame for this behavior, and duration of
individual jobs is comparable with some favor of 2.x. Neither
StringIndexer nor OneHotEncoder changed much in 2.x. They used RDDs for
fitting in 1.6 and, as far as I can tell, they still do that in 2.x. And
the problem doesn't look that related to the data processing part in the
first place.


On 02/02/2017 07:22 AM, Liang-Chi Hsieh wrote:
> Hi Maciej,
>
> FYI, the PR is at https://github.com/apache/spark/pull/16775.
>
>
> Liang-Chi Hsieh wrote
>> Hi Maciej,
>>
>> Basically the fitting algorithm in Pipeline is an iterative operation.
>> Running iterative algorithm on Dataset would have RDD lineages and query
>> plans that grow fast. Without cache and checkpoint, it gets slower when
>> the iteration number increases.
>>
>> I think it is why when you run a Pipeline with long stages, it gets much
>> longer time to finish. As I think it is not uncommon to have long stages
>> in a Pipeline, we should improve this. I will submit a PR for this.
>> zero323 wrote
>>> Hi everyone,
>>>
>>> While experimenting with ML pipelines I experience a significant
>>> performance regression when switching from 1.6.x to 2.x.
>>>
>>> import org.apache.spark.ml.{Pipeline, PipelineStage}
>>> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
>>> VectorAssembler}
>>>
>>> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
>>> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
>>> val indexers = df.columns.tail.map(c => new StringIndexer()
>>>   .setInputCol(c)
>>>   .setOutputCol(s"${c}_indexed")
>>>   .setHandleInvalid("skip"))
>>>
>>> val encoders = indexers.map(indexer => new OneHotEncoder()
>>>   .setInputCol(indexer.getOutputCol)
>>>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
>>>   .setDropLast(true))
>>>
>>> val assembler = new
>>> VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
>>> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler
>>>
>>> new Pipeline().setStages(stages).fit(df).transform(df).show
>>>
>>> Task execution time is comparable and executors are most of the time
>>> idle so it looks like it is a problem with the optimizer. Is it a known
>>> issue? Are there any changes I've missed, that could lead to this
>>> behavior?
>>>
>>> -- 
>>> Best,
>>> Maciej
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: 
>>> dev-unsubscribe@.apache
>
>
>
>
> -----
> Liang-Chi Hsieh | @viirya 
> Spark Technology Center 
> http://www.spark.tc/ 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-- 
Maciej Szymkiewicz

Reply via email to