Re: Preliminary Parquet numbers and including .count() in Catalyst

Michael Armbrust Tue, 13 May 2014 15:08:24 -0700

>
> It looks like currently the .count() on parquet is handled incredibly
> inefficiently and all the columns are materialized.  But if I select just
> that relevant column and then count, then the column-oriented storage of
> Parquet really shines.
>
> There ought to be a potential optimization here such that a .count() on a
> SchemaRDD backed by Parquet doesn't require re-assembling the rows, as
> that's expensive.  I don't think .count() is handled specially in
> SchemaRDDs, but it seems ripe for optimization.
>


Yeah, you are right.  Thanks for pointing this out!

If you call .count() that is just the native Spark count, which is not
aware of the potential optimizations.  We could just override count() in a
schema RDD to be something like
"groupBy()(Count(Literal(1))).collect().head.getInt(0)"

Here is a JIRA: SPARK-1822 - SchemaRDD.count() should use the
optimizer.<https://issues.apache.org/jira/browse/SPARK-1822>

Michael

Re: Preliminary Parquet numbers and including .count() in Catalyst

Reply via email to