Hi,

I have a SPARK table (created from hiveContext) with couple of hundred
partitions and few thousand files.

When I run query on the table then spark spends a lot of time (as seen in
the pyspark output) to collect this files from the several partitions.
After this the query starts running.

Is there a way to store the object which has collected all these partitions
and files so that every time I restart the job I load this object instead
of taking  50 mins to just collect the files before starting to run the
query?


Please do let me know in case the question is not quite clear.

Regards,
Gourav Sengupta

Reply via email to