Re: General Question (Spark Hive integration )

Silvio Fiorito Thu, 21 Jan 2016 19:08:07 -0800

Also, just to clarify it doesn’t read the whole table into memory unless you 
specifically cache it.

From: Silvio Fiorito 
<silvio.fior...@granturing.com<mailto:silvio.fior...@granturing.com>>
Date: Thursday, January 21, 2016 at 10:02 PM
To: "Balaraju.Kagidala Kagidala" 
<balaraju.kagid...@gmail.com<mailto:balaraju.kagid...@gmail.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: General Question (Spark Hive integration )

Hi Bala,

It depends on how your Hive table is configured. If you used partitioning and 
you are filtering on a partition column then it will only load the relevant 
partitions. If, however, you’re filtering on a non-partitioned column then it 
will have to read all the data and then filter as part of the Spark job.

Thanks,
Silvio

From: "Balaraju.Kagidala Kagidala" 
<balaraju.kagid...@gmail.com<mailto:balaraju.kagid...@gmail.com>>
Date: Thursday, January 21, 2016 at 9:37 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: General Question (Spark Hive integration )

Hi ,

  I have simple question regarding Spark Hive integration with DataFrames.

When we query  for a table, does spark loads whole table into memory and 
applies the filter on top of it or it only loads the data with filter applied.

for example if the my query 'select * from employee where deptno=10' does my 
rdd loads whole employee data into memory and applies fileter or will it load 
only dept number 10 data.

Thanks
Bala

Re: General Question (Spark Hive integration )

Reply via email to