Re: hiveContext: storing lookup of partitions

Gourav Sengupta Wed, 16 Dec 2015 03:59:07 -0800

The HIVE table has very large number of partitions around 365 * 5 * 10 and
when I say hivemetastore to start running queries on it (the one with
.count() or .show()) then it takes around 2 hours before the job starts in
SPARK.


On the pyspark screen I can see that it is parsing the S3 locations for
these 2 hours.

Regards,
Gourav

On Wed, Dec 16, 2015 at 3:38 AM, Jeff Zhang <zjf...@gmail.com> wrote:

> >>> Currently it takes around 1.5 hours for me just to cache in the
> partition information and after that I can see that the job gets queued in
> the SPARK UI.
> I guess you mean the stage of getting the split info. I suspect it might
> be your cluster issue (or metadata store), unusually it won't take such
> long time for splitting.
>
> On Wed, Dec 16, 2015 at 8:06 AM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a HIVE table with few thousand partitions (based on date and
>> time). It takes a long time to run if for the first time and then
>> subsequently it is fast.
>>
>> Is there a way to store the cache of partition lookups so that every time
>> I start a new SPARK instance (cannot keep my personal server running
>> continuously), I can immediately restore back the temptable in hiveContext
>> without asking it go again and cache the partition lookups?
>>
>> Currently it takes around 1.5 hours for me just to cache in the partition
>> information and after that I can see that the job gets queued in the SPARK
>> UI.
>>
>> Regards,
>> Gourav
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: hiveContext: storing lookup of partitions

Reply via email to