Thanks for filing -- I'm keeping my eye out for updates on that ticket. Cheers! Andrew
On Tue, May 13, 2014 at 2:40 PM, Michael Armbrust <mich...@databricks.com>wrote: > > > > It looks like currently the .count() on parquet is handled incredibly > > inefficiently and all the columns are materialized. But if I select just > > that relevant column and then count, then the column-oriented storage of > > Parquet really shines. > > > > There ought to be a potential optimization here such that a .count() on a > > SchemaRDD backed by Parquet doesn't require re-assembling the rows, as > > that's expensive. I don't think .count() is handled specially in > > SchemaRDDs, but it seems ripe for optimization. > > > > Yeah, you are right. Thanks for pointing this out! > > If you call .count() that is just the native Spark count, which is not > aware of the potential optimizations. We could just override count() in a > schema RDD to be something like > "groupBy()(Count(Literal(1))).collect().head.getInt(0)" > > Here is a JIRA: SPARK-1822 - SchemaRDD.count() should use the > optimizer.<https://issues.apache.org/jira/browse/SPARK-1822> > > Michael >