Re: hiveContext: storing lookup of partitions

2015-12-16 Thread Jeff Zhang
oh, you are using S3. As I remember, S3 has performance issue when processing large amount of files. On Wed, Dec 16, 2015 at 7:58 PM, Gourav Sengupta wrote: > The HIVE table has very large number of partitions around 365 * 5 * 10 and > when I say hivemetastore to

Re: hiveContext: storing lookup of partitions

2015-12-16 Thread Gourav Sengupta
Hi Jeff, sadly that does not resolve the issue. I am sure that the memory mapping to physical files locations can be saved and recovered in SPARK. Regards, Gourav Sengupta On Wed, Dec 16, 2015 at 12:13 PM, Jeff Zhang wrote: > oh, you are using S3. As I remember, S3 has

Re: hiveContext: storing lookup of partitions

2015-12-16 Thread Gourav Sengupta
The HIVE table has very large number of partitions around 365 * 5 * 10 and when I say hivemetastore to start running queries on it (the one with .count() or .show()) then it takes around 2 hours before the job starts in SPARK. On the pyspark screen I can see that it is parsing the S3 locations

hiveContext: storing lookup of partitions

2015-12-15 Thread Gourav Sengupta
Hi, I have a HIVE table with few thousand partitions (based on date and time). It takes a long time to run if for the first time and then subsequently it is fast. Is there a way to store the cache of partition lookups so that every time I start a new SPARK instance (cannot keep my personal

Re: hiveContext: storing lookup of partitions

2015-12-15 Thread Jeff Zhang
>>> Currently it takes around 1.5 hours for me just to cache in the partition information and after that I can see that the job gets queued in the SPARK UI. I guess you mean the stage of getting the split info. I suspect it might be your cluster issue (or metadata store), unusually it won't take