[jira] [Updated] (SPARK-32693) Compare two dataframes with same schema except nullable property

Takeshi Yamamuro (Jira) Wed, 02 Sep 2020 21:11:18 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Takeshi Yamamuro updated SPARK-32693:
-------------------------------------
    Fix Version/s:     (was: 3.0.1)
                   3.0.2

> Compare two dataframes with same schema except nullable property
> ----------------------------------------------------------------
>
>                 Key: SPARK-32693
>                 URL: https://issues.apache.org/jira/browse/SPARK-32693
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.4, 3.1.0
>            Reporter: david bernuau
>            Assignee: L. C. Hsieh
>            Priority: Minor
>             Fix For: 2.4.7, 3.1.0, 3.0.2
>
>
> My aim is to compare two dataframes with very close schemas : same number of 
> fields, with the same names, types and metadata. The only difference comes 
> from the fact that a given field might be nullable in one dataframe and not 
> in the other.
> Here is the code that i used :
> {code:java}
> val session = SparkSession.builder().getOrCreate()
> import org.apache.spark.sql.Row
> import java.sql.Timestamp
> import scala.collection.JavaConverters._
> case class A(g: Timestamp, h: Option[Timestamp], i: Int)
> case class B(e: Int, f: Seq[A])
> case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int])
> case class D(e: Option[Int], f: Seq[C])
> val schema1 = StructType(Array(StructField("a", IntegerType, false), 
> StructField("b", IntegerType, false), StructField("c", IntegerType, false)))
> val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2))
> val df1 = session.createDataFrame(rowSeq1.asJava, schema1)
> df1.printSchema()
> val schema2 = StructType(Array(StructField("a", IntegerType), 
> StructField("b", IntegerType), StructField("c", IntegerType)))
> val rowSeq2: List[Row] = List(Row(10, 1, 1))
> val df2 = session.createDataFrame(rowSeq2.asJava, schema2)
> df2.printSchema()
> println(s"Number of records for first case : ${df1.except(df2).count()}")
> val schema3 = StructType(
>  Array(
>  StructField("a", IntegerType, false),
>  StructField("b", IntegerType, false), 
>  StructField("c", IntegerType, false), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, 
> false), StructField("f", ArrayType(StructType(Array(StructField("g", 
> TimestampType), StructField("h", TimestampType), StructField("i", 
> IntegerType, false)
>  ))))
>  ))))
>  )
>  )
> val date1 = new Timestamp(1597589638L)
> val date2 = new Timestamp(1597599638L)
> val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, 
> 1))))), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2))))))
> val df3 = session.createDataFrame(rowSeq3.asJava, schema3)
> df3.printSchema()
> val schema4 = StructType(
>  Array(
>  StructField("a", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), 
> StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), 
> StructField("h", TimestampType), StructField("i", IntegerType)
>  ))))
>  ))))
>  )
>  )
> val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, 
> None, Some(1)))))))
> val df4 = session.createDataFrame(rowSeq4.asJava, schema3)
> df4.printSchema()
> println(s"Number of records for second case : ${df3.except(df4).count()}")
> {code}
> The preceeding code shows what seems to me a bug in Spark :
>  * If you consider two dataframes (df1 and df2) having exactly the same 
> schema, except fields are not nullable for the first dataframe and are 
> nullable for the second. Then, doing df1.except(df2).count() works well.
>  * Now, if you consider two other dataframes (df3 and df4) having the same 
> schema (with fields nullable on one side and not on the other). If these two 
> dataframes contain nested fields, then, this time, the action 
> df3.except(df4).count gives the following exception : 
> java.lang.IllegalArgumentException: requirement failed: Join keys from two 
> sides should have same types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32693) Compare two dataframes with same schema except nullable property

Reply via email to