Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

Nick Pentreath Wed, 01 Feb 2017 23:39:04 -0800

Hi Maciej

If you're seeing a regression from 1.6 -> 2.0 *both using DataFrames *then
that seems to point to some other underlying issue as the root cause.


Even though adding checkpointing should help, we should understand why it's
different between 1.6 and 2.0?


On Thu, 2 Feb 2017 at 08:22 Liang-Chi Hsieh <[email protected]> wrote:

>
> Hi Maciej,
>
> FYI, the PR is at https://github.com/apache/spark/pull/16775.
>
>
> Liang-Chi Hsieh wrote
> > Hi Maciej,
> >
> > Basically the fitting algorithm in Pipeline is an iterative operation.
> > Running iterative algorithm on Dataset would have RDD lineages and query
> > plans that grow fast. Without cache and checkpoint, it gets slower when
> > the iteration number increases.
> >
> > I think it is why when you run a Pipeline with long stages, it gets much
> > longer time to finish. As I think it is not uncommon to have long stages
> > in a Pipeline, we should improve this. I will submit a PR for this.
> > zero323 wrote
> >> Hi everyone,
> >>
> >> While experimenting with ML pipelines I experience a significant
> >> performance regression when switching from 1.6.x to 2.x.
> >>
> >> import org.apache.spark.ml.{Pipeline, PipelineStage}
> >> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
> >> VectorAssembler}
> >>
> >> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
> >> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
> >> val indexers = df.columns.tail.map(c => new StringIndexer()
> >>   .setInputCol(c)
> >>   .setOutputCol(s"${c}_indexed")
> >>   .setHandleInvalid("skip"))
> >>
> >> val encoders = indexers.map(indexer => new OneHotEncoder()
> >>   .setInputCol(indexer.getOutputCol)
> >>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
> >>   .setDropLast(true))
> >>
> >> val assembler = new
> >> VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
> >> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler
> >>
> >> new Pipeline().setStages(stages).fit(df).transform(df).show
> >>
> >> Task execution time is comparable and executors are most of the time
> >> idle so it looks like it is a problem with the optimizer. Is it a known
> >> issue? Are there any changes I've missed, that could lead to this
> >> behavior?
> >>
> >> --
> >> Best,
> >> Maciej
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail:
>
> >> [email protected]
>
>
>
>
>
> -----
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

Reply via email to