[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r201919906 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala --- @@ -48,7 +48,8 @@ case class CreateArray(children: Seq[Expression]) extends Expression { override def dataType: ArrayType = { --- End diff -- Because the data type of the `CreateArray` itself is not the merged type but `ArrayType`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r201902635 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala --- @@ -48,7 +48,8 @@ case class CreateArray(children: Seq[Expression]) extends Expression { override def dataType: ArrayType = { --- End diff -- why `CreateArray` doesn't extend `ComplexTypeMergingExpression`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r201902242 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -610,27 +636,27 @@ object TypeCoercion { // Coalesce should return the first non-null value, which could be any column // from the list. So we need to make sure the return type is deterministic and // compatible with every child column. - case c @ Coalesce(es) if !haveSameType(es) => + case c @ Coalesce(es) if !c.areInputTypesForMergingEqual => val types = es.map(_.dataType) findWiderCommonType(types) match { - case Some(finalDataType) => Coalesce(es.map(Cast(_, finalDataType))) + case Some(finalDataType) => Coalesce(es.map(castIfNotSameType(_, finalDataType))) case None => c } // When finding wider type for `Greatest` and `Least`, we should handle decimal types even if // we need to truncate, but we should not promote one side to string if the other side is // string.g - case g @ Greatest(children) if !haveSameType(children) => + case g @ Greatest(children) if !g.areInputTypesForMergingEqual => --- End diff -- these are not just concat, can you update the PR title? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r201900374 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -533,15 +559,15 @@ object TypeCoercion { case a @ CreateArray(children) if !haveSameType(children) => val types = children.map(_.dataType) findWiderCommonType(types) match { - case Some(finalDataType) => CreateArray(children.map(Cast(_, finalDataType))) + case Some(finalDataType) => CreateArray(children.map(castIfNotSameType(_, finalDataType))) --- End diff -- Currently the optimizer doesn't remove the cast when the difference of the `finalDataType` is only the nullabilities of nested types. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r200265965 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) --- End diff -- Please ignore the part of my previous comment regarding ```flatten``` function. The output data type of ```concat```, etc. will be the same regardless what resolves ```null``` flags. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r200216296 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) + case dt => dt +}.getOrElse(StringType) + } --- End diff -- +1 for the @mn-mikke idea --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r200134825 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) --- End diff -- @ueshin For ```Concat```, ```Coalesce```, etc. it seems to be that case since a coercion rule is executed if there is any nullability difference on any level of nesting. But it's not the case of ```CaseWhenCoercion``` rule, since ```sameType``` method is used for comparison. I'm wondering if the goal is to avoid generation of extra ```Cast``` expressions, shouldn't other coercion rules utilize ```sameType``` method as well? Let's assume that the result of ```concat``` is subsequently used by ```flatten```, wouldn't it lead to generation of extra null safe checks as mentioned [here](https://github.com/apache/spark/pull/21704#discussion_r200110924)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r200127250 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) + case dt => dt +}.getOrElse(StringType) + } --- End diff -- SGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r200125302 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) --- End diff -- I guess the inner nullabilities are coerced during type-coercion? If the inner nullabilities are different, type coercion adds casts and they will remain. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r200116898 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) --- End diff -- yeah, + ```valueContainsNull``` for ```MapType``` and ```nullable``` for ```StructField``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r200116003 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) + case dt => dt +}.getOrElse(StringType) + } --- End diff -- Oh, see. In that case, it would be nice to introduce a method that will resolve the output DataType and merges ```nullable```/```containNull``` flags of non-primitive types recursively for such expressions. For the most cases we could encapsulate the function into a new trait. WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r200115019 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) --- End diff -- then shall we fix the `containNull` for the inner array? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r200113984 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) --- End diff -- Yes, it should work (see a [test](https://github.com/ueshin/apache-spark/blob/30d5aed41085cfb82e3da43099adf972b953ece8/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala#L951) for it). Did we miss anything? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r200111312 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) --- End diff -- do we support array of array in concat? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r200110924 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) + case dt => dt +}.getOrElse(StringType) + } --- End diff -- > what about changing SimplifyCasts rule to start replacing Cast with a new dummy cast expression that will hold only a target data type and won't perform any casting? This can work, but my point is we should not add the cast to change `containsNull`, as it may not match the underlying child expression and generates unnecessary null check code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r200060536 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) + case dt => dt +}.getOrElse(StringType) + } --- End diff -- Yeah, `Coalesce` should be fixed, and `Least` and `Greatest` are also suspicious. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r200054252 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) + case dt => dt +}.getOrElse(StringType) + } --- End diff -- So far, we've identified also ```CaseWhen``` and ```If``` discussed [here](https://github.com/apache/spark/pull/21687). I've just noticed that ```Coalesce``` looks also suspicious. What is the key purpose of ```SimplifyCasts```? to remove an extra expression node or avoid casts from between to indentical types? If the second option is the purpose, what about changing ```SimplifyCasts``` rule to start replacing ```Cast``` with a new dummy cast expression that will hold only a target data type and won't perform any casting? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r200053804 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) + case dt => dt +}.getOrElse(StringType) + } --- End diff -- Currently we've found `CaseWhen` and `If` as at #21687. Seems like there are no other expressions in the collection functions at a glance. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r200042880 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) + case dt => dt +}.getOrElse(StringType) + } --- End diff -- I think `SimplifyCasts` is valid, but the concat type coercion rule is sub-optimal. We should not force all the children to have same `containsNull`, which may add unnecessary null checks inside `Concat`. BTW are there more expression like `Concat` that we need to fix the `containsNull`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r200026633 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) + case dt => dt +}.getOrElse(StringType) + } --- End diff -- Hmm, that also makes sense. I'm not sure we can remove the simplification, though. cc @gatorsmile @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r200025173 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) + case dt => dt +}.getOrElse(StringType) + } --- End diff -- Aha, I see. But, I just have a hunch that `SimplifyCasts` cannot simplify array casts in some cases?, e.g., this concat case. Since we basically cannot change semantics in optimization phase, I feel a little weird about this simplification. Anyway, I'm ok with your approach because I can't find a better & simpler way to solve this in analysis phase... Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r200024783 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) + case dt => dt +}.getOrElse(StringType) + } --- End diff -- Added a test to show the wrong nullability. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r19346 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) + case dt => dt +}.getOrElse(StringType) + } --- End diff -- Actually, `Concat` for array type has the type coercion to add casts to make all children the same type, but we also have the optimization `SimplifyCasts` to remove unnecessary casts which might remove casts from arrays not contains null to arrays contains null ([optimizer/expressions.scala#L611](https://github.com/apache/spark/blob/d87a8c6c0d1a4db5c9444781160a65562f8ea738/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L611)). E.g., `concat(array(1,2,3), array(4,null,6))` might generate a wrong data type during the execution. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21704#discussion_r14021 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -2007,7 +2007,14 @@ case class Concat(children: Seq[Expression]) extends Expression { } } - override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType) + override def dataType: DataType = { +val dataTypes = children.map(_.dataType) +dataTypes.headOption.map { + case ArrayType(et, _) => +ArrayType(et, dataTypes.exists(_.asInstanceOf[ArrayType].containsNull)) + case dt => dt +}.getOrElse(StringType) + } --- End diff -- Can't we handle this case in type coercion (analysis phase)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21704: [SPARK-24734][SQL] Fix containsNull of Concat for...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/21704 [SPARK-24734][SQL] Fix containsNull of Concat for array type. ## What changes were proposed in this pull request? Currently `Concat` for array type uses the data type of the first child as its own data type, but the children might include an array containing nulls. We should aware the nullabilities of all children. ## How was this patch tested? Modified and added some tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-24734/concat_containsnull Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21704.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21704 commit d87a8c6c0d1a4db5c9444781160a65562f8ea738 Author: Takuya UESHIN Date: 2018-07-03T11:21:06Z Fix containsNull of Concat for array type. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org