dataframe df1:
schema:
StructType(StructField(x,IntegerType,true))
explain:
== Physical Plan ==
MapPartitions <function1>, obj#135: object, [if (input[0, object].isNullAt)
null else input[0, object].get AS x#128]
+- MapPartitions <function1>, createexternalrow(if (isnull(x#9)) null else
x#9), [input[0, object] AS obj#135]
   +- WholeStageCodegen
      :  +- Project [_1#8 AS x#9]
      :     +- Scan ExistingRDD[_1#8]
show:
+---+
|  x|
+---+
|  2|
|  3|
+---+


dataframe df2:
schema:
StructType(StructField(x,IntegerType,true), StructField(y,StringType,true))
explain:
== Physical Plan ==
MapPartitions <function1>, createexternalrow(x#2, if (isnull(y#3)) null
else y#3.toString), [if (input[0, object].isNullAt) null else input[0,
object].get AS x#130,if (input[0, object].isNullAt) null else
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType,
fromString, input[0, object].get, true) AS y#131]
+- WholeStageCodegen
   :  +- Project [_1#0 AS x#2,_2#1 AS y#3]
   :     +- Scan ExistingRDD[_1#0,_2#1]
show:
+---+---+
|  x|  y|
+---+---+
|  1|  1|
|  2|  2|
|  3|  3|
+---+---+


i run:
df1.join(df2, Seq("x")).show

i get:
java.lang.UnsupportedOperationException: No size estimation available for
objects.
at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41)
at
org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
at
org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.immutable.List.map(List.scala:285)
at
org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323)
at
org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87)

now sure what changed, this ran about a week ago without issues (in our
internal unit tests). it is fully reproducible, however when i tried to
minimize the issue i could not reproduce it by just creating data frames in
the repl with the same contents, so it probably has something to do with
way these are created (from Row objects and StructTypes).

best, koert

Reply via email to