Re: hiveContext: storing lookup of partitions
oh, you are using S3. As I remember, S3 has performance issue when processing large amount of files. On Wed, Dec 16, 2015 at 7:58 PM, Gourav Senguptawrote: > The HIVE table has very large number of partitions around 365 * 5 * 10 and > when I say hivemetastore to start running queries on it (the one with > .count() or .show()) then it takes around 2 hours before the job starts in > SPARK. > > On the pyspark screen I can see that it is parsing the S3 locations for > these 2 hours. > > Regards, > Gourav > > On Wed, Dec 16, 2015 at 3:38 AM, Jeff Zhang wrote: > >> >>> Currently it takes around 1.5 hours for me just to cache in the >> partition information and after that I can see that the job gets queued in >> the SPARK UI. >> I guess you mean the stage of getting the split info. I suspect it might >> be your cluster issue (or metadata store), unusually it won't take such >> long time for splitting. >> >> On Wed, Dec 16, 2015 at 8:06 AM, Gourav Sengupta < >> gourav.sengu...@gmail.com> wrote: >> >>> Hi, >>> >>> I have a HIVE table with few thousand partitions (based on date and >>> time). It takes a long time to run if for the first time and then >>> subsequently it is fast. >>> >>> Is there a way to store the cache of partition lookups so that every >>> time I start a new SPARK instance (cannot keep my personal server running >>> continuously), I can immediately restore back the temptable in hiveContext >>> without asking it go again and cache the partition lookups? >>> >>> Currently it takes around 1.5 hours for me just to cache in the >>> partition information and after that I can see that the job gets queued in >>> the SPARK UI. >>> >>> Regards, >>> Gourav >>> >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > -- Best Regards Jeff Zhang
Re: hiveContext: storing lookup of partitions
Hi Jeff, sadly that does not resolve the issue. I am sure that the memory mapping to physical files locations can be saved and recovered in SPARK. Regards, Gourav Sengupta On Wed, Dec 16, 2015 at 12:13 PM, Jeff Zhangwrote: > oh, you are using S3. As I remember, S3 has performance issue when > processing large amount of files. > > > > On Wed, Dec 16, 2015 at 7:58 PM, Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >> The HIVE table has very large number of partitions around 365 * 5 * 10 >> and when I say hivemetastore to start running queries on it (the one with >> .count() or .show()) then it takes around 2 hours before the job starts in >> SPARK. >> >> On the pyspark screen I can see that it is parsing the S3 locations for >> these 2 hours. >> >> Regards, >> Gourav >> >> On Wed, Dec 16, 2015 at 3:38 AM, Jeff Zhang wrote: >> >>> >>> Currently it takes around 1.5 hours for me just to cache in the >>> partition information and after that I can see that the job gets queued in >>> the SPARK UI. >>> I guess you mean the stage of getting the split info. I suspect it might >>> be your cluster issue (or metadata store), unusually it won't take such >>> long time for splitting. >>> >>> On Wed, Dec 16, 2015 at 8:06 AM, Gourav Sengupta < >>> gourav.sengu...@gmail.com> wrote: >>> Hi, I have a HIVE table with few thousand partitions (based on date and time). It takes a long time to run if for the first time and then subsequently it is fast. Is there a way to store the cache of partition lookups so that every time I start a new SPARK instance (cannot keep my personal server running continuously), I can immediately restore back the temptable in hiveContext without asking it go again and cache the partition lookups? Currently it takes around 1.5 hours for me just to cache in the partition information and after that I can see that the job gets queued in the SPARK UI. Regards, Gourav >>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> >> > > > -- > Best Regards > > Jeff Zhang >
Re: hiveContext: storing lookup of partitions
The HIVE table has very large number of partitions around 365 * 5 * 10 and when I say hivemetastore to start running queries on it (the one with .count() or .show()) then it takes around 2 hours before the job starts in SPARK. On the pyspark screen I can see that it is parsing the S3 locations for these 2 hours. Regards, Gourav On Wed, Dec 16, 2015 at 3:38 AM, Jeff Zhangwrote: > >>> Currently it takes around 1.5 hours for me just to cache in the > partition information and after that I can see that the job gets queued in > the SPARK UI. > I guess you mean the stage of getting the split info. I suspect it might > be your cluster issue (or metadata store), unusually it won't take such > long time for splitting. > > On Wed, Dec 16, 2015 at 8:06 AM, Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >> Hi, >> >> I have a HIVE table with few thousand partitions (based on date and >> time). It takes a long time to run if for the first time and then >> subsequently it is fast. >> >> Is there a way to store the cache of partition lookups so that every time >> I start a new SPARK instance (cannot keep my personal server running >> continuously), I can immediately restore back the temptable in hiveContext >> without asking it go again and cache the partition lookups? >> >> Currently it takes around 1.5 hours for me just to cache in the partition >> information and after that I can see that the job gets queued in the SPARK >> UI. >> >> Regards, >> Gourav >> > > > > -- > Best Regards > > Jeff Zhang >
hiveContext: storing lookup of partitions
Hi, I have a HIVE table with few thousand partitions (based on date and time). It takes a long time to run if for the first time and then subsequently it is fast. Is there a way to store the cache of partition lookups so that every time I start a new SPARK instance (cannot keep my personal server running continuously), I can immediately restore back the temptable in hiveContext without asking it go again and cache the partition lookups? Currently it takes around 1.5 hours for me just to cache in the partition information and after that I can see that the job gets queued in the SPARK UI. Regards, Gourav
Re: hiveContext: storing lookup of partitions
>>> Currently it takes around 1.5 hours for me just to cache in the partition information and after that I can see that the job gets queued in the SPARK UI. I guess you mean the stage of getting the split info. I suspect it might be your cluster issue (or metadata store), unusually it won't take such long time for splitting. On Wed, Dec 16, 2015 at 8:06 AM, Gourav Senguptawrote: > Hi, > > I have a HIVE table with few thousand partitions (based on date and time). > It takes a long time to run if for the first time and then subsequently it > is fast. > > Is there a way to store the cache of partition lookups so that every time > I start a new SPARK instance (cannot keep my personal server running > continuously), I can immediately restore back the temptable in hiveContext > without asking it go again and cache the partition lookups? > > Currently it takes around 1.5 hours for me just to cache in the partition > information and after that I can see that the job gets queued in the SPARK > UI. > > Regards, > Gourav > -- Best Regards Jeff Zhang