[ https://issues.apache.org/jira/browse/SPARK-39976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Graves updated SPARK-39976: ---------------------------------- Labels: (was: corr) > NULL check in ArrayIntersect adds extraneous null from first param > ------------------------------------------------------------------ > > Key: SPARK-39976 > URL: https://issues.apache.org/jira/browse/SPARK-39976 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.3.0 > Reporter: Navin Kumar > Priority: Blocker > > This is very likely a regression from SPARK-36829. > When using {{array_intersect(a, b)}}, if the first parameter contains a > {{NULL}} value and the second one does not, an extraneous {{NULL}} is present > in the output. This also leads to {{array_intersect(a, b) != > array_intersect(b, a)}} which is incorrect as set intersection should be > commutative. > Example using PySpark: > {code:python} > >>> a = [1, 2, 3] > >>> b = [3, None, 5] > >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"]) > >>> df.show() > +---------+------------+ > | a| b| > +---------+------------+ > |[1, 2, 3]|[3, null, 5]| > +---------+------------+ > >>> df.selectExpr("array_intersect(a,b)").show() > +---------------------+ > |array_intersect(a, b)| > +---------------------+ > | [3]| > +---------------------+ > >>> df.selectExpr("array_intersect(b,a)").show() > +---------------------+ > |array_intersect(b, a)| > +---------------------+ > | [3, null]| > +---------------------+ > {code} > Note that in the first case, {{a}} does not contain a {{NULL}}, and the final > output is correct: {{[3]}}. In the second case, since {{b}} does contain > {{NULL}} and is now the first parameter. > The same behavior occurs in Scala when writing to Parquet: > {code:scala} > scala> val a = Array[java.lang.Integer](1, 2, null, 4) > a: Array[Integer] = Array(1, 2, null, 4) > scala> val b = Array[java.lang.Integer](4, 5, 6, 7) > b: Array[Integer] = Array(4, 5, 6, 7) > scala> val df = Seq((a, b)).toDF("a","b") > df: org.apache.spark.sql.DataFrame = [a: array<int>, b: array<int>] > scala> df.write.parquet("/tmp/simple.parquet") > scala> val df = spark.read.parquet("/tmp/simple.parquet") > df: org.apache.spark.sql.DataFrame = [a: array<int>, b: array<int>] > scala> df.show() > +---------------+------------+ > | a| b| > +---------------+------------+ > |[1, 2, null, 4]|[4, 5, 6, 7]| > +---------------+------------+ > scala> df.selectExpr("array_intersect(a,b)").show() > +---------------------+ > |array_intersect(a, b)| > +---------------------+ > | [null, 4]| > +---------------------+ > scala> df.selectExpr("array_intersect(b,a)").show() > +---------------------+ > |array_intersect(b, a)| > +---------------------+ > | [4]| > +---------------------+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org