Re: Creating RDD from only few columns of a Parquet file

Reynold Xin Tue, 13 Jan 2015 00:57:40 -0800

What query did you run? Parquet should have predicate and column pushdown,
i.e. if your query only needs to read 3 columns, then only 3 will be read.


On Mon, Jan 12, 2015 at 10:20 PM, Ajay Srivastava <
a_k_srivast...@yahoo.com.invalid> wrote:

> Hi,
> I am trying to read a parquet file using -
>
> val parquetFile = sqlContext.parquetFile("people.parquet")
>
> There is no way to specify that I am interested in reading only some columns 
> from disk. For example, If the parquet file has 10 columns and want to read 
> only 3 columns from disk.
>
> We have done an experiment -
> Table1 - Parquet file containing 10 columns
> Table2 - Parquet file containing only 3 columns which were used in query
>
> The time taken by query on table1 and table2 shows huge difference. Query on 
> Table1 takes more than double of time taken on table2 which makes me think 
> that spark is reading all the columns from disk in case of table1 when it 
> needs only 3 columns.
>
> How should I make sure that it reads only 3 of 10 columns from disk ?
>
>
> Regards,
> Ajay
>
>

Re: Creating RDD from only few columns of a Parquet file

Reply via email to