There have been optimizations in this area, such as:
https://issues.apache.org/jira/browse/SPARK-8125

You can also look at parent issue. 

Which Spark release are you using ?

> On Jan 22, 2016, at 1:08 AM, Gourav Sengupta <gourav.sengu...@gmail.com> 
> wrote:
> 
> 
> Hi,
> 
> I have a SPARK table (created from hiveContext) with couple of hundred 
> partitions and few thousand files. 
> 
> When I run query on the table then spark spends a lot of time (as seen in the 
> pyspark output) to collect this files from the several partitions. After this 
> the query starts running. 
> 
> Is there a way to store the object which has collected all these partitions 
> and files so that every time I restart the job I load this object instead of 
> taking  50 mins to just collect the files before starting to run the query?
> 
> 
> Please do let me know in case the question is not quite clear.
> 
> Regards,
> Gourav Sengupta 
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to