dataframe df1: schema: StructType(StructField(x,IntegerType,true)) explain: == Physical Plan == MapPartitions <function1>, obj#135: object, [if (input[0, object].isNullAt) null else input[0, object].get AS x#128] +- MapPartitions <function1>, createexternalrow(if (isnull(x#9)) null else x#9), [input[0, object] AS obj#135] +- WholeStageCodegen : +- Project [_1#8 AS x#9] : +- Scan ExistingRDD[_1#8] show: +---+ | x| +---+ | 2| | 3| +---+
dataframe df2: schema: StructType(StructField(x,IntegerType,true), StructField(y,StringType,true)) explain: == Physical Plan == MapPartitions <function1>, createexternalrow(x#2, if (isnull(y#3)) null else y#3.toString), [if (input[0, object].isNullAt) null else input[0, object].get AS x#130,if (input[0, object].isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, object].get, true) AS y#131] +- WholeStageCodegen : +- Project [_1#0 AS x#2,_2#1 AS y#3] : +- Scan ExistingRDD[_1#0,_2#1] show: +---+---+ | x| y| +---+---+ | 1| 1| | 2| 2| | 3| 3| +---+---+ i run: df1.join(df2, Seq("x")).show i get: java.lang.UnsupportedOperationException: No size estimation available for objects. at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41) at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323) at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323) at org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87) now sure what changed, this ran about a week ago without issues (in our internal unit tests). it is fully reproducible, however when i tried to minimize the issue i could not reproduce it by just creating data frames in the repl with the same contents, so it probably has something to do with way these are created (from Row objects and StructTypes). best, koert