[GitHub] [spark] MaxGekk opened a new pull request #28388: [SPARK-31553][SQL] Revert "[SPARK-29048] Improve performance on Column.isInCollection() with a large size collection"

GitBox Tue, 28 Apr 2020 00:18:54 -0700


MaxGekk opened a new pull request #28388:
URL: https://github.com/apache/spark/pull/28388



   ### What changes were proposed in this pull request?
   This reverts commit 5631a96367d2576e1e0f95d7ae529468da8f5fa8.
   
   ### Why are the changes needed?
   The PR  https://github.com/apache/spark/pull/25754 introduced a bug in 
`isInCollection`. For example, if the SQL config 
`spark.sql.optimizer.inSetConversionThreshold`is set to 10 (by default):
   ```scala
   val set = (0 to 20).map(_.toString).toSet
   val data = Seq("1").toDF("x")
   data.select($"x".isInCollection(set).as("isInCollection")).show()
   ```
   The function must return **'true'** because "1" is in the set of "0" ... 
"20" but it returns "false":
   ```
   +--------------+
   |isInCollection|
   +--------------+
   |         false|
   +--------------+
   ```
   
   ### Does this PR introduce any user-facing change?
   Yes
   
   ### How was this patch tested?
   ```
   $ ./build/sbt "test:testOnly *ColumnExpressionSuite"
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk opened a new pull request #28388: [SPARK-31553][SQL] Revert "[SPARK-29048] Improve performance on Column.isInCollection() with a large size collection"

Reply via email to