Github user MrBago commented on a diff in the pull request: https://github.com/apache/spark/pull/20229#discussion_r161153997 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -230,16 +231,17 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String) val encodedTerms = resolvedFormula.terms.map { case Seq(term) if dataset.schema(term).dataType == StringType => val encodedCol = tmpColumn("onehot") - var encoder = new OneHotEncoder() - .setInputCol(indexed(term)) - .setOutputCol(encodedCol) // Formula w/o intercept, one of the categories in the first category feature is // being used as reference category, we will not drop any category for that feature. if (!hasIntercept && !keepReferenceCategory) { - encoder = encoder.setDropLast(false) + encoderStages += new OneHotEncoderEstimator(uid) + .setInputCols(Array(indexed(term))) + .setOutputCols(Array(encodedCol)) + .setDropLast(false) --- End diff -- There is at most 1 encoder with `dropLast(false)`, the next line sets `keepReferenceCategory = true` to ensure we won't take this code path for the remaining columns.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org