The HIVE table has very large number of partitions around 365 * 5 * 10 and when I say hivemetastore to start running queries on it (the one with .count() or .show()) then it takes around 2 hours before the job starts in SPARK.
On the pyspark screen I can see that it is parsing the S3 locations for these 2 hours. Regards, Gourav On Wed, Dec 16, 2015 at 3:38 AM, Jeff Zhang <zjf...@gmail.com> wrote: > >>> Currently it takes around 1.5 hours for me just to cache in the > partition information and after that I can see that the job gets queued in > the SPARK UI. > I guess you mean the stage of getting the split info. I suspect it might > be your cluster issue (or metadata store), unusually it won't take such > long time for splitting. > > On Wed, Dec 16, 2015 at 8:06 AM, Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >> Hi, >> >> I have a HIVE table with few thousand partitions (based on date and >> time). It takes a long time to run if for the first time and then >> subsequently it is fast. >> >> Is there a way to store the cache of partition lookups so that every time >> I start a new SPARK instance (cannot keep my personal server running >> continuously), I can immediately restore back the temptable in hiveContext >> without asking it go again and cache the partition lookups? >> >> Currently it takes around 1.5 hours for me just to cache in the partition >> information and after that I can see that the job gets queued in the SPARK >> UI. >> >> Regards, >> Gourav >> > > > > -- > Best Regards > > Jeff Zhang >