Github user MrBago commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20229#discussion_r161153997
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
---
    @@ -230,16 +231,17 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") 
override val uid: String)
         val encodedTerms = resolvedFormula.terms.map {
           case Seq(term) if dataset.schema(term).dataType == StringType =>
             val encodedCol = tmpColumn("onehot")
    -        var encoder = new OneHotEncoder()
    -          .setInputCol(indexed(term))
    -          .setOutputCol(encodedCol)
             // Formula w/o intercept, one of the categories in the first 
category feature is
             // being used as reference category, we will not drop any category 
for that feature.
             if (!hasIntercept && !keepReferenceCategory) {
    -          encoder = encoder.setDropLast(false)
    +          encoderStages += new OneHotEncoderEstimator(uid)
    +            .setInputCols(Array(indexed(term)))
    +            .setOutputCols(Array(encodedCol))
    +            .setDropLast(false)
    --- End diff --
    
    There is at most 1 encoder with `dropLast(false)`, the next line sets 
`keepReferenceCategory = true` to ensure we won't take this code path for the 
remaining columns.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to