[ https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Takeshi Yamamuro updated SPARK-32693: ------------------------------------- Fix Version/s: (was: 3.0.1) 3.0.2 > Compare two dataframes with same schema except nullable property > ---------------------------------------------------------------- > > Key: SPARK-32693 > URL: https://issues.apache.org/jira/browse/SPARK-32693 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.4, 3.1.0 > Reporter: david bernuau > Assignee: L. C. Hsieh > Priority: Minor > Fix For: 2.4.7, 3.1.0, 3.0.2 > > > My aim is to compare two dataframes with very close schemas : same number of > fields, with the same names, types and metadata. The only difference comes > from the fact that a given field might be nullable in one dataframe and not > in the other. > Here is the code that i used : > {code:java} > val session = SparkSession.builder().getOrCreate() > import org.apache.spark.sql.Row > import java.sql.Timestamp > import scala.collection.JavaConverters._ > case class A(g: Timestamp, h: Option[Timestamp], i: Int) > case class B(e: Int, f: Seq[A]) > case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int]) > case class D(e: Option[Int], f: Seq[C]) > val schema1 = StructType(Array(StructField("a", IntegerType, false), > StructField("b", IntegerType, false), StructField("c", IntegerType, false))) > val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2)) > val df1 = session.createDataFrame(rowSeq1.asJava, schema1) > df1.printSchema() > val schema2 = StructType(Array(StructField("a", IntegerType), > StructField("b", IntegerType), StructField("c", IntegerType))) > val rowSeq2: List[Row] = List(Row(10, 1, 1)) > val df2 = session.createDataFrame(rowSeq2.asJava, schema2) > df2.printSchema() > println(s"Number of records for first case : ${df1.except(df2).count()}") > val schema3 = StructType( > Array( > StructField("a", IntegerType, false), > StructField("b", IntegerType, false), > StructField("c", IntegerType, false), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, > false), StructField("f", ArrayType(StructType(Array(StructField("g", > TimestampType), StructField("h", TimestampType), StructField("i", > IntegerType, false) > )))) > )))) > ) > ) > val date1 = new Timestamp(1597589638L) > val date2 = new Timestamp(1597599638L) > val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, > 1))))), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2)))))) > val df3 = session.createDataFrame(rowSeq3.asJava, schema3) > df3.printSchema() > val schema4 = StructType( > Array( > StructField("a", IntegerType), > StructField("b", IntegerType), > StructField("b", IntegerType), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), > StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), > StructField("h", TimestampType), StructField("i", IntegerType) > )))) > )))) > ) > ) > val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, > None, Some(1))))))) > val df4 = session.createDataFrame(rowSeq4.asJava, schema3) > df4.printSchema() > println(s"Number of records for second case : ${df3.except(df4).count()}") > {code} > The preceeding code shows what seems to me a bug in Spark : > * If you consider two dataframes (df1 and df2) having exactly the same > schema, except fields are not nullable for the first dataframe and are > nullable for the second. Then, doing df1.except(df2).count() works well. > * Now, if you consider two other dataframes (df3 and df4) having the same > schema (with fields nullable on one side and not on the other). If these two > dataframes contain nested fields, then, this time, the action > df3.except(df4).count gives the following exception : > java.lang.IllegalArgumentException: requirement failed: Join keys from two > sides should have same types -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org