oh, you are using S3. As I remember, S3 has performance issue when processing large amount of files.
On Wed, Dec 16, 2015 at 7:58 PM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > The HIVE table has very large number of partitions around 365 * 5 * 10 and > when I say hivemetastore to start running queries on it (the one with > .count() or .show()) then it takes around 2 hours before the job starts in > SPARK. > > On the pyspark screen I can see that it is parsing the S3 locations for > these 2 hours. > > Regards, > Gourav > > On Wed, Dec 16, 2015 at 3:38 AM, Jeff Zhang <zjf...@gmail.com> wrote: > >> >>> Currently it takes around 1.5 hours for me just to cache in the >> partition information and after that I can see that the job gets queued in >> the SPARK UI. >> I guess you mean the stage of getting the split info. I suspect it might >> be your cluster issue (or metadata store), unusually it won't take such >> long time for splitting. >> >> On Wed, Dec 16, 2015 at 8:06 AM, Gourav Sengupta < >> gourav.sengu...@gmail.com> wrote: >> >>> Hi, >>> >>> I have a HIVE table with few thousand partitions (based on date and >>> time). It takes a long time to run if for the first time and then >>> subsequently it is fast. >>> >>> Is there a way to store the cache of partition lookups so that every >>> time I start a new SPARK instance (cannot keep my personal server running >>> continuously), I can immediately restore back the temptable in hiveContext >>> without asking it go again and cache the partition lookups? >>> >>> Currently it takes around 1.5 hours for me just to cache in the >>> partition information and after that I can see that the job gets queued in >>> the SPARK UI. >>> >>> Regards, >>> Gourav >>> >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > -- Best Regards Jeff Zhang