Thanks for filing -- I'm keeping my eye out for updates on that ticket.
Cheers!
Andrew
On Tue, May 13, 2014 at 2:40 PM, Michael Armbrust wrote:
> >
> > It looks like currently the .count() on parquet is handled incredibly
> > inefficiently and all the columns are materialized. But if I select
>
> It looks like currently the .count() on parquet is handled incredibly
> inefficiently and all the columns are materialized. But if I select just
> that relevant column and then count, then the column-oriented storage of
> Parquet really shines.
>
> There ought to be a potential optimization he
These numbers were run on git commit 756c96 (a few days after the 1.0.0-rc3
tag). Do you have a link to the patch that avoids scanning all columns for
count(*) or count(1)? I'd like to give it a shot.
Andrew
On Mon, May 12, 2014 at 11:41 PM, Reynold Xin wrote:
> Thanks for the experiments an
Thanks for the experiments and analysis!
I think Michael already submitted a patch that avoids scanning all columns
for count(*) or count(1).
On Mon, May 12, 2014 at 9:46 PM, Andrew Ash wrote:
> Hi Spark devs,
>
> First of all, huge congrats on the parquet integration with SparkSQL! This
> is
Hi Spark devs,
First of all, huge congrats on the parquet integration with SparkSQL! This
is an incredible direction forward and something I can see being very
broadly useful.
I was doing some preliminary tests to see how it works with one of my
workflows, and wanted to share some numbers that p