Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Andrew Ash
Thanks for filing -- I'm keeping my eye out for updates on that ticket. Cheers! Andrew On Tue, May 13, 2014 at 2:40 PM, Michael Armbrust wrote: > > > > It looks like currently the .count() on parquet is handled incredibly > > inefficiently and all the columns are materialized. But if I select

Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Michael Armbrust
> > It looks like currently the .count() on parquet is handled incredibly > inefficiently and all the columns are materialized. But if I select just > that relevant column and then count, then the column-oriented storage of > Parquet really shines. > > There ought to be a potential optimization he

Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-13 Thread Andrew Ash
These numbers were run on git commit 756c96 (a few days after the 1.0.0-rc3 tag). Do you have a link to the patch that avoids scanning all columns for count(*) or count(1)? I'd like to give it a shot. Andrew On Mon, May 12, 2014 at 11:41 PM, Reynold Xin wrote: > Thanks for the experiments an

Re: Preliminary Parquet numbers and including .count() in Catalyst

2014-05-12 Thread Reynold Xin
Thanks for the experiments and analysis! I think Michael already submitted a patch that avoids scanning all columns for count(*) or count(1). On Mon, May 12, 2014 at 9:46 PM, Andrew Ash wrote: > Hi Spark devs, > > First of all, huge congrats on the parquet integration with SparkSQL! This > is

Preliminary Parquet numbers and including .count() in Catalyst

2014-05-12 Thread Andrew Ash
Hi Spark devs, First of all, huge congrats on the parquet integration with SparkSQL! This is an incredible direction forward and something I can see being very broadly useful. I was doing some preliminary tests to see how it works with one of my workflows, and wanted to share some numbers that p