Re: Preliminary Parquet numbers and including .count() in Catalyst

Andrew Ash Tue, 13 May 2014 20:52:25 -0700

Thanks for filing -- I'm keeping my eye out for updates on that ticket.

Cheers!
Andrew



On Tue, May 13, 2014 at 2:40 PM, Michael Armbrust <mich...@databricks.com>wrote:

> >
> > It looks like currently the .count() on parquet is handled incredibly
> > inefficiently and all the columns are materialized.  But if I select just
> > that relevant column and then count, then the column-oriented storage of
> > Parquet really shines.
> >
> > There ought to be a potential optimization here such that a .count() on a
> > SchemaRDD backed by Parquet doesn't require re-assembling the rows, as
> > that's expensive.  I don't think .count() is handled specially in
> > SchemaRDDs, but it seems ripe for optimization.
> >
>
> Yeah, you are right.  Thanks for pointing this out!
>
> If you call .count() that is just the native Spark count, which is not
> aware of the potential optimizations.  We could just override count() in a
> schema RDD to be something like
> "groupBy()(Count(Literal(1))).collect().head.getInt(0)"
>
> Here is a JIRA: SPARK-1822 - SchemaRDD.count() should use the
> optimizer.<https://issues.apache.org/jira/browse/SPARK-1822>
>
> Michael
>

Re: Preliminary Parquet numbers and including .count() in Catalyst

Reply via email to