Also, just to clarify it doesn’t read the whole table into memory unless you specifically cache it.
From: Silvio Fiorito <silvio.fior...@granturing.com<mailto:silvio.fior...@granturing.com>> Date: Thursday, January 21, 2016 at 10:02 PM To: "Balaraju.Kagidala Kagidala" <balaraju.kagid...@gmail.com<mailto:balaraju.kagid...@gmail.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: General Question (Spark Hive integration ) Hi Bala, It depends on how your Hive table is configured. If you used partitioning and you are filtering on a partition column then it will only load the relevant partitions. If, however, you’re filtering on a non-partitioned column then it will have to read all the data and then filter as part of the Spark job. Thanks, Silvio From: "Balaraju.Kagidala Kagidala" <balaraju.kagid...@gmail.com<mailto:balaraju.kagid...@gmail.com>> Date: Thursday, January 21, 2016 at 9:37 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: General Question (Spark Hive integration ) Hi , I have simple question regarding Spark Hive integration with DataFrames. When we query for a table, does spark loads whole table into memory and applies the filter on top of it or it only loads the data with filter applied. for example if the my query 'select * from employee where deptno=10' does my rdd loads whole employee data into memory and applies fileter or will it load only dept number 10 data. Thanks Bala