Dan Osipov created SPARK-26366:
----------------------------------

             Summary: Except with transform regression
                 Key: SPARK-26366
                 URL: https://issues.apache.org/jira/browse/SPARK-26366
             Project: Spark
          Issue Type: Bug
          Components: Spark Core, SQL
    Affects Versions: 2.3.2
            Reporter: Dan Osipov


There appears to be a regression between Spark 2.2 and 2.3. Below is the code 
to reproduce it:

 
{code:java}
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._


val inputDF = spark.sqlContext.createDataFrame(
  spark.sparkContext.parallelize(Seq(
    Row("0", "john", "smith", "j...@smith.com"),
    Row("1", "jane", "doe", "j...@doe.com"),
    Row("2", "apache", "spark", "sp...@apache.org"),
    Row("3", "foo", "bar", null)
  )),
  StructType(List(
    StructField("id", StringType, nullable=true),
    StructField("first_name", StringType, nullable=true),
    StructField("last_name", StringType, nullable=true),
    StructField("email", StringType, nullable=true)
  ))
)

val exceptDF = inputDF.transform( toProcessDF =>
  toProcessDF.filter(
      (
        col("first_name").isin(Seq("john", "jane"): _*)
          and col("last_name").isin(Seq("smith", "doe"): _*)
      )
      or col("email").isin(List(): _*)
  )
)

inputDF.except(exceptDF).show()
{code}
Output with Spark 2.2:
{noformat}
+---+----------+---------+----------------+
| id|first_name|last_name| email|
+---+----------+---------+----------------+
| 2| apache| spark|sp...@apache.org|
| 3| foo| bar| null|
+---+----------+---------+----------------+{noformat}
Output with Spark 2.3:
{noformat}
+---+----------+---------+----------------+
| id|first_name|last_name| email|
+---+----------+---------+----------------+
| 2| apache| spark|sp...@apache.org|
+---+----------+---------+----------------+{noformat}
Note, changing the last line to 
{code:java}
inputDF.except(exceptDF.cache()).show()
{code}
produces identical output for both Spark 2.3 and 2.2

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to