Hi, I have a SPARK table (created from hiveContext) with couple of hundred partitions and few thousand files.
When I run query on the table then spark spends a lot of time (as seen in the pyspark output) to collect this files from the several partitions. After this the query starts running. Is there a way to store the object which has collected all these partitions and files so that every time I restart the job I load this object instead of taking 50 mins to just collect the files before starting to run the query? Please do let me know in case the question is not quite clear. Regards, Gourav Sengupta