What query did you run? Parquet should have predicate and column pushdown, i.e. if your query only needs to read 3 columns, then only 3 will be read.
On Mon, Jan 12, 2015 at 10:20 PM, Ajay Srivastava < a_k_srivast...@yahoo.com.invalid> wrote: > Hi, > I am trying to read a parquet file using - > > val parquetFile = sqlContext.parquetFile("people.parquet") > > There is no way to specify that I am interested in reading only some columns > from disk. For example, If the parquet file has 10 columns and want to read > only 3 columns from disk. > > We have done an experiment - > Table1 - Parquet file containing 10 columns > Table2 - Parquet file containing only 3 columns which were used in query > > The time taken by query on table1 and table2 shows huge difference. Query on > Table1 takes more than double of time taken on table2 which makes me think > that spark is reading all the columns from disk in case of table1 when it > needs only 3 columns. > > How should I make sure that it reads only 3 of 10 columns from disk ? > > > Regards, > Ajay > >