[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22408 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219039607 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -735,6 +735,60 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { df.selectExpr("array_contains(array(1, null), array(1, null)[0])"), Seq(Row(true), Row(true)) ) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.23D)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.0D)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.0D), 1)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.23D), 1)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.0D))"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.23D))"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.23))"), --- End diff -- @cloud-fan Yes. it should :-) I think i had changed this test case to verify the fix to tighestCommonType.. and pushed it by mistake. Sorry about it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219039187 --- Diff: docs/sql-programming-guide.md --- @@ -1879,6 +1879,66 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3 to 2.4 + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), '1'); + + +true + + +AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner. --- End diff -- Ah then it's fine, we don't need to change anything here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219038798 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -735,6 +735,60 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { df.selectExpr("array_contains(array(1, null), array(1, null)[0])"), Seq(Row(true), Row(true)) ) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.23D)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.0D)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.0D), 1)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.23D), 1)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.0D))"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.23D))"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.23))"), --- End diff -- hmm? shouldn't this fail because of the bug? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219038350 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -735,6 +735,60 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { df.selectExpr("array_contains(array(1, null), array(1, null)[0])"), Seq(Row(true), Row(true)) ) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.23D)"), --- End diff -- this query doesn't read any data from `df`, so the 2 result rows are always same. Can we use `OneRowRelation` here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219037732 --- Diff: docs/sql-programming-guide.md --- @@ -1879,6 +1879,66 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3 to 2.4 + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), '1'); + + +true + + +AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner. --- End diff -- @cloud-fan Yeah, presto gives error. Please refer to my earlier comment showing the presto output. Did you want anything to change in the description ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219037301 --- Diff: docs/sql-programming-guide.md --- @@ -1879,6 +1879,66 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3 to 2.4 + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), '1'); + + +true + + +AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner. --- End diff -- If presto doesn't do it, we should follow it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219037194 --- Diff: docs/sql-programming-guide.md --- @@ -1879,6 +1879,66 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3 to 2.4 + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), '1'); + + +true + + +AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner. --- End diff -- We can promote `int` to `string`, but I'm not sure that's a common behavior in other databases --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219037086 --- Diff: docs/sql-programming-guide.md --- @@ -1879,6 +1879,66 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3 to 2.4 + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively. --- End diff -- remove `both`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219028470 --- Diff: docs/sql-programming-guide.md --- @@ -1879,6 +1879,80 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3 to 2.4 + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), 1.34); --- End diff -- @cloud-fan OK. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219019453 --- Diff: docs/sql-programming-guide.md --- @@ -1879,6 +1879,80 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3 to 2.4 + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), 1.34); --- End diff -- can we remove this example? or pick a different one that can work. Basically it should work if we fix the bug with https://github.com/apache/spark/pull/22448 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r218295350 --- Diff: docs/sql-programming-guide.md --- @@ -1879,6 +1879,80 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3 to 2.4 + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), 1.34); + + +true + + +AnalysisException is thrown since integer type can not be promoted to decimal type in a loss-less manner. --- End diff -- I left a few comments. Please send a PR, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r218227611 --- Diff: docs/sql-programming-guide.md --- @@ -1879,6 +1879,80 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3 to 2.4 + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), 1.34); + + +true + + +AnalysisException is thrown since integer type can not be promoted to decimal type in a loss-less manner. --- End diff -- @cloud-fan I have taken a stab at fixing this at https://github.com/dilipbiswal/spark/tree/find_tightest May i request you to please take a look ? I have run all the tests and they look ok. If you are okay, i will open a PR for this and once that gets in, we can look at this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r218125861 --- Diff: docs/sql-programming-guide.md --- @@ -1879,6 +1879,80 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3 to 2.4 + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), 1.34); + + +true + + +AnalysisException is thrown since integer type can not be promoted to decimal type in a loss-less manner. --- End diff -- @cloud-fan Yes..Actually, i think, the logic to handle int and decimal promotion can be improved. I like presto's behaviour here - int, decimal(3, 2) => decimal(12, 2) int, decimal(4, 3) => decimal(13, 3) int, decimal(11, 3) => decimal(14, 3) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r218052897 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -1331,23 +1331,27 @@ case class ArrayContains(left: Expression, right: Expression) @transient private lazy val ordering: Ordering[Any] = TypeUtils.getInterpretedOrdering(right.dataType) - override def inputTypes: Seq[AbstractDataType] = right.dataType match { -case NullType => Seq.empty -case _ => left.dataType match { - case n @ ArrayType(element, _) => Seq(n, element) + override def inputTypes: Seq[AbstractDataType] = { +(left.dataType, right.dataType) match { + case (_, NullType) => Seq.empty + case (ArrayType(e1, hasNull), e2) => +TypeCoercion.findTightestCommonType(e1, e2) match { --- End diff -- I think we have a bug in the `findTightestCommonType`. For an int and a decimal, `findTightestCommonType` can return a value if the decimal's precision is bigger than int. But it can't return a value if the decimal's precision is small. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r218043239 --- Diff: docs/sql-programming-guide.md --- @@ -1879,6 +1879,80 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3 to 2.4 + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), 1.34); + + +true + + +AnalysisException is thrown since integer type can not be promoted to decimal type in a loss-less manner. --- End diff -- hhmm, I think we can convert int to `DecimalType(10, 0)` without losing precision? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r217603789 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -735,6 +735,44 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { df.selectExpr("array_contains(array(1, null), array(1, null)[0])"), Seq(Row(true), Row(true)) ) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.23D)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.0D)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.0D), 1)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.23D), 1)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.0D))"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.23D))"), + Seq(Row(false), Row(false)) +) + +intercept[AnalysisException] { + df.selectExpr("array_contains(array(1), 1.23)") --- End diff -- @ueshin Sure. We could use `findWiderCommonType`. My thinking was, since we are injecting this cast implicitly, we should pick the safest cast so we don't see data dependent surprises. Users could always specify an explicit cast and take the the responsibility of the result :-) However, i don't have a strong opinion. I will change it use findWiderCommonType --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r217601459 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -735,6 +735,44 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { df.selectExpr("array_contains(array(1, null), array(1, null)[0])"), Seq(Row(true), Row(true)) ) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.23D)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.0D)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.0D), 1)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.23D), 1)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.0D))"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.23D))"), + Seq(Row(false), Row(false)) +) + +intercept[AnalysisException] { + df.selectExpr("array_contains(array(1), 1.23)") --- End diff -- Hmm, if we can `array_contains(array(1), '1')` in 2.3 as https://github.com/apache/spark/pull/22408#issuecomment-421223501, we should use `findWiderCommonType`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r217593872 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -735,6 +735,44 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { df.selectExpr("array_contains(array(1, null), array(1, null)[0])"), Seq(Row(true), Row(true)) ) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.23D)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.0D)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.0D), 1)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.23D), 1)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.0D))"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.23D))"), + Seq(Row(false), Row(false)) +) + +intercept[AnalysisException] { + df.selectExpr("array_contains(array(1), 1.23)") --- End diff -- Good point. Yes, I can do a lossy conversion. Seems like `BinaryComparison` uses wider DecimalType, so we could follow the behavior to use `findWiderTypeWithoutStringPromotion`. cc @gatorsmile @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r217287140 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -735,6 +735,44 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { df.selectExpr("array_contains(array(1, null), array(1, null)[0])"), Seq(Row(true), Row(true)) ) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.23D)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.0D)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.0D), 1)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.23D), 1)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.0D))"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.23D))"), + Seq(Row(false), Row(false)) +) + +intercept[AnalysisException] { + df.selectExpr("array_contains(array(1), 1.23)") --- End diff -- @ueshin FYI - i just pushed the changes addressing your comments. I have kept `findTightestCommonType` as is for now. I will change it based on your input and adjust the tests accordingly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r217279564 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -735,6 +735,44 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { df.selectExpr("array_contains(array(1, null), array(1, null)[0])"), Seq(Row(true), Row(true)) ) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.23D)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.0D)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.0D), 1)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.23D), 1)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.0D))"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.23D))"), + Seq(Row(false), Row(false)) +) + +intercept[AnalysisException] { + df.selectExpr("array_contains(array(1), 1.23)") --- End diff -- @ueshin Thank you. I also thought of using `findWiderTypeWithoutStringPromotion` but later changed it to `findTightestCommonType`. One question i had was, can `findWiderTypeWithoutStringPromotion` do a lossy conversion. From the class description it seems only findTightestCommonType does a absolute safe casting ? Please let me know what you think .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r217277655 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -735,6 +735,44 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { df.selectExpr("array_contains(array(1, null), array(1, null)[0])"), Seq(Row(true), Row(true)) ) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.23D)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.0D)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.0D), 1)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.23D), 1)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.0D))"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.23D))"), + Seq(Row(false), Row(false)) +) + +intercept[AnalysisException] { + df.selectExpr("array_contains(array(1), 1.23)") --- End diff -- As for the behavior, we might be better to use `findWiderTypeWithoutStringPromotion` instead of `findTightestCommonType`. And as for checking the error message, just those you added. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r217274643 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -735,6 +735,44 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { df.selectExpr("array_contains(array(1, null), array(1, null)[0])"), Seq(Row(true), Row(true)) ) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.23D)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.0D)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.0D), 1)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.23D), 1)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.0D))"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.23D))"), + Seq(Row(false), Row(false)) +) + +intercept[AnalysisException] { + df.selectExpr("array_contains(array(1), 1.23)") --- End diff -- @ueshin Did you want me to check the error message for all the cases ? or just a couple of them ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r217274487 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -1331,26 +1331,31 @@ case class ArrayContains(left: Expression, right: Expression) @transient private lazy val ordering: Ordering[Any] = TypeUtils.getInterpretedOrdering(right.dataType) - override def inputTypes: Seq[AbstractDataType] = right.dataType match { -case NullType => Seq.empty -case _ => left.dataType match { - case n @ ArrayType(element, _) => Seq(n, element) + override def inputTypes: Seq[AbstractDataType] = { +(left.dataType, right.dataType) match { + case (_, NullType) => Seq.empty + case (ArrayType(e1, hasNull), e2) => +TypeCoercion.findTightestCommonType(e1, e2) match { + case Some(dt) => Seq(ArrayType(dt, hasNull), dt) + case _ => Seq.empty +} case _ => Seq.empty } } override def checkInputDataTypes(): TypeCheckResult = { -if (right.dataType == NullType) { - TypeCheckResult.TypeCheckFailure("Null typed values cannot be used as arguments") -} else if (!left.dataType.isInstanceOf[ArrayType] - || !left.dataType.asInstanceOf[ArrayType].elementType.sameType(right.dataType)) { - TypeCheckResult.TypeCheckFailure( -"Arguments must be an array followed by a value of same type as the array members") -} else { - TypeUtils.checkForOrderingExpr(right.dataType, s"function $prettyName") +(left.dataType, right.dataType) match { + case (_, NullType) => +TypeCheckResult.TypeCheckFailure("Null typed values cannot be used as arguments") + case (ArrayType(e1, _), e2) if e1.sameType(e2) => +TypeUtils.checkForOrderingExpr(right.dataType, s"function $prettyName") + case _ => TypeCheckResult.TypeCheckFailure(s"Input to function $prettyName should have " + +s"been ${ArrayType.simpleString} followed by a value with same element type, but it's " + +s"[${left.dataType.catalogString}, ${right.dataType.catalogString}].") } } + --- End diff -- @ueshin Sure.. sorry.. didn't notice it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r217274391 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -735,6 +735,44 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { df.selectExpr("array_contains(array(1, null), array(1, null)[0])"), Seq(Row(true), Row(true)) ) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.23D)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.0D)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.0D), 1)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.23D), 1)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.0D))"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.23D))"), + Seq(Row(false), Row(false)) +) + +intercept[AnalysisException] { + df.selectExpr("array_contains(array(1), 1.23)") --- End diff -- @ueshin This fails since 1.23 is parsed as decimal(3, 2). And `findTightestCommonType` can not find a common type between integer and decimal(3,2). I specifically added this case to draw it to your attention to see if we are okay with this behaviour :-). If we had it as `1.23D` then we would have been able to find a common type. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r217272407 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -735,6 +735,44 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { df.selectExpr("array_contains(array(1, null), array(1, null)[0])"), Seq(Row(true), Row(true)) ) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.23D)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.0D)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.0D), 1)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.23D), 1)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.0D))"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.23D))"), + Seq(Row(false), Row(false)) +) + +intercept[AnalysisException] { + df.selectExpr("array_contains(array(1), 1.23)") --- End diff -- Also could you check the error message as well? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r217269231 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -1331,26 +1331,31 @@ case class ArrayContains(left: Expression, right: Expression) @transient private lazy val ordering: Ordering[Any] = TypeUtils.getInterpretedOrdering(right.dataType) - override def inputTypes: Seq[AbstractDataType] = right.dataType match { -case NullType => Seq.empty -case _ => left.dataType match { - case n @ ArrayType(element, _) => Seq(n, element) + override def inputTypes: Seq[AbstractDataType] = { +(left.dataType, right.dataType) match { + case (_, NullType) => Seq.empty + case (ArrayType(e1, hasNull), e2) => +TypeCoercion.findTightestCommonType(e1, e2) match { + case Some(dt) => Seq(ArrayType(dt, hasNull), dt) + case _ => Seq.empty +} case _ => Seq.empty } } override def checkInputDataTypes(): TypeCheckResult = { -if (right.dataType == NullType) { - TypeCheckResult.TypeCheckFailure("Null typed values cannot be used as arguments") -} else if (!left.dataType.isInstanceOf[ArrayType] - || !left.dataType.asInstanceOf[ArrayType].elementType.sameType(right.dataType)) { - TypeCheckResult.TypeCheckFailure( -"Arguments must be an array followed by a value of same type as the array members") -} else { - TypeUtils.checkForOrderingExpr(right.dataType, s"function $prettyName") +(left.dataType, right.dataType) match { + case (_, NullType) => +TypeCheckResult.TypeCheckFailure("Null typed values cannot be used as arguments") + case (ArrayType(e1, _), e2) if e1.sameType(e2) => +TypeUtils.checkForOrderingExpr(right.dataType, s"function $prettyName") + case _ => TypeCheckResult.TypeCheckFailure(s"Input to function $prettyName should have " + +s"been ${ArrayType.simpleString} followed by a value with same element type, but it's " + +s"[${left.dataType.catalogString}, ${right.dataType.catalogString}].") } } + --- End diff -- nit: can you revert this change? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r217272398 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -735,6 +735,44 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { df.selectExpr("array_contains(array(1, null), array(1, null)[0])"), Seq(Row(true), Row(true)) ) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.23D)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.0D)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.0D), 1)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.23D), 1)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.0D))"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.23D))"), + Seq(Row(false), Row(false)) +) + +intercept[AnalysisException] { + df.selectExpr("array_contains(array(1), 1.23)") --- End diff -- Why does this fail? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/22408 [SPARK-25417][SQL] ArrayContains function may return incorrect result when right expression is implicitly down casted ## What changes were proposed in this pull request? In ArrayContains, we currently cast the right hand side expression to match the element type of the left hand side Array. This may result in down casting and may return wrong result or questionable result. Example : ```SQL spark-sql> select array_contains(array(1), 1.34); true ``` ```SQL spark-sql> select array_contains(array(1), 'foo'); null ``` We should safely coerce both left and right hand side expressions. ## How was this patch tested? Added tests in DataFrameFunctionsSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark SPARK-25417 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22408.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22408 commit 2c6e8dc16c3f7418374dcabea92f5dcf966c5469 Author: Dilip Biswal Date: 2018-09-12T19:06:14Z [SPARK-25417] ArrayContains function may return incorrect result when right expression is implicitly down casted --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org