Creating RDD from only few columns of a Parquet file

Ajay Srivastava Mon, 12 Jan 2015 22:21:07 -0800

Hi,I am trying to read a parquet file using -val parquetFile = 
sqlContext.parquetFile("people.parquet")


There is no way to specify that I am interested in reading only some columns 
from disk. For example, If the parquet file has 10 columns and want to read 
only 3 columns from disk.

We have done an experiment -
Table1 - Parquet file containing 10 columns
Table2 - Parquet file containing only 3 columns which were used in query 

The time taken by query on table1 and table2 shows huge difference. Query on 
Table1 takes more than double of time taken on table2 which makes me think that 
spark is reading all the columns from disk in case of table1 when it needs only 
3 columns.

How should I make sure that it reads only 3 of 10 columns from disk ?


Regards,
Ajay

Creating RDD from only few columns of a Parquet file

Reply via email to