Github user viirya commented on the issue: https://github.com/apache/spark/pull/19527 Benchmark against multi-column one hot encoder. Multi-Col, Multiple run: The first commit. Run multiple `treeAggregate` on columns. Multi-Col, Single Run: Run one `treeAggregate` on all columns, see suggestion at https://github.com/apache/spark/pull/19527#discussion_r145457081. Fitting: numColums | Multi-Col, Multiple run | Multi-Col, Single Run -- | -- | -- 1 | 0.11003638430000003 | 0.12968824099999998 100 | 3.6879334635000007 | 0.36438897839999995 1000 | 90.3695017947 | 2.4687475008 Transforming: numColums | Multi-Col, Multiple run | Multi-Col, Single Run -- | -- | -- 1 | 0.14080461019999999 | 0.1434849307 100 | 0.3636357813 | 0.41459606969999996 1000 | 3.1933874685 | 2.8026313985 Benchmark codes: ```scala import org.apache.spark.ml.feature._ import org.apache.spark.sql.Row import org.apache.spark.sql.types._ import spark.implicits._ import scala.util.Random val seed = 123l val random = new Random(seed) val n = 10000 val m = 1000 val rows = sc.parallelize(1 to n).map(i=> Row(Array.fill(m)(random.nextInt(1000)): _*)) val struct = new StructType(Array.range(0,m,1).map(i => StructField(s"c$i",IntegerType,true))) val df = spark.createDataFrame(rows, struct) df.persist() df.count() val inputCols = Array.range(0,m,1).map(i => s"c$i") val outputCols = Array.range(0,m,1).map(i => s"c${i}_encoded") val encoder = new OneHotEncoderEstimator().setInputCols(inputCols).setOutputCols(outputCols) var durationFitting = 0.0 var durationTransforming = 0.0 for (i <- 0 until 10) { val startFitting = System.nanoTime() val model = encoder.fit(df) val endFitting = System.nanoTime() durationFitting += (endFitting - startFitting) / 1e9 val startTransforming = System.nanoTime() model.transform(df).count val endTransforming = System.nanoTime() durationTransforming += (endTransforming - startTransforming) / 1e9 } println(s"fitting: ${durationFitting / 10}") println(s"transforming: ${durationTransforming / 10}")
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org