Hello, I'm seeing a strange behavior where count() on a DataFrame errors as shown below but collect() works fine. This is what I tried from spark-shell. solrRDD.queryShards() return a javaRDD.
val rdd = solrRDD.queryShards(sc, query, "_version_", 2).rdd > rdd: org.apache.spark.rdd.RDD[org.apache.solr.common.SolrDocument] = > MapPartitionsRDD[3] at flatMap at SolrRDD.java:335 > > scala> val schema = solrRDD.getQuerySchema(query) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(ApplicationType,StringType,true), > StructField(Language,StringType,true), > StructField(MfgCode,StringType,true), > StructField(OpSystemCode,StringType,true), > StructField(ProductCode,StringType,true), > StructField(ProductName,StringType,true), > StructField(ProductVersion,StringType,true), > StructField(_version_,LongType,true), StructField(id,StringType,true)) > scala> val rows = rdd.map(doc => RowFactory.create(schema.fieldNames.map(f > => doc.getFirstValue(f))) ) //Convert RDD[SolrDocument] to RDD[Row] > scala> val df = sqlContext.createDataFrame(rows, schema) scala> val data = df.collect > data: Array[org.apache.spark.sql.Row] = Array([[Ljava.lang.Object;@2135773a], > [[Ljava.lang.Object;@3d2691de], [[Ljava.lang.Object;@2f32a52f], > [[Ljava.lang.Object;@25fac8de] > > scala> df.count > 15/08/26 14:53:28 WARN TaskSetManager: Lost task 1.3 in stage 6.0 (TID 42, > 172.19.110.1): java.lang.AssertionError: assertion failed: Row column > number mismatch, expected 9 columns, but got 1. > Row content: [[Ljava.lang.Object;@1d962eb2] > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:140) > at > org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:124) > at > org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:277) Any idea what is wrong here? Srikanth