Re: Creating RDD from only few columns of a Parquet file

Ajay Srivastava Tue, 13 Jan 2015 00:54:08 -0800

Setting spark.sql.hive.convertMetastoreParquet to true has fixed this.

Regards,Ajay


     On Tuesday, January 13, 2015 11:50 AM, Ajay Srivastava 
<a_k_srivast...@yahoo.com.INVALID> wrote:
   

 Hi,I am trying to read a parquet file using -val parquetFile = 
sqlContext.parquetFile("people.parquet")

There is no way to specify that I am interested in reading only some columns 
from disk. For example, If the parquet file has 10 columns and want to read 
only 3 columns from disk.

We have done an experiment -
Table1 - Parquet file containing 10 columns
Table2 - Parquet file containing only 3 columns which were used in query 

The time taken by query on table1 and table2 shows huge difference. Query on 
Table1 takes more than double of time taken on table2 which makes me think that 
spark is reading all the columns from disk in case of table1 when it needs only 
3 columns.

How should I make sure that it reads only 3 of 10 columns from disk ?


Regards,
Ajay

Re: Creating RDD from only few columns of a Parquet file

Reply via email to