Re: hiveContext: storing lookup of partitions

2015-12-16 Thread Jeff Zhang
oh, you are using S3. As I remember,  S3 has performance issue when
processing large amount of files.



On Wed, Dec 16, 2015 at 7:58 PM, Gourav Sengupta 
wrote:

> The HIVE table has very large number of partitions around 365 * 5 * 10 and
> when I say hivemetastore to start running queries on it (the one with
> .count() or .show()) then it takes around 2 hours before the job starts in
> SPARK.
>
> On the pyspark screen I can see that it is parsing the S3 locations for
> these 2 hours.
>
> Regards,
> Gourav
>
> On Wed, Dec 16, 2015 at 3:38 AM, Jeff Zhang  wrote:
>
>> >>> Currently it takes around 1.5 hours for me just to cache in the
>> partition information and after that I can see that the job gets queued in
>> the SPARK UI.
>> I guess you mean the stage of getting the split info. I suspect it might
>> be your cluster issue (or metadata store), unusually it won't take such
>> long time for splitting.
>>
>> On Wed, Dec 16, 2015 at 8:06 AM, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a HIVE table with few thousand partitions (based on date and
>>> time). It takes a long time to run if for the first time and then
>>> subsequently it is fast.
>>>
>>> Is there a way to store the cache of partition lookups so that every
>>> time I start a new SPARK instance (cannot keep my personal server running
>>> continuously), I can immediately restore back the temptable in hiveContext
>>> without asking it go again and cache the partition lookups?
>>>
>>> Currently it takes around 1.5 hours for me just to cache in the
>>> partition information and after that I can see that the job gets queued in
>>> the SPARK UI.
>>>
>>> Regards,
>>> Gourav
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


-- 
Best Regards

Jeff Zhang


Re: hiveContext: storing lookup of partitions

2015-12-16 Thread Gourav Sengupta
Hi Jeff,

sadly that does not resolve the issue. I am sure that the memory mapping to
physical files locations can be saved and recovered in SPARK.


Regards,
Gourav Sengupta

On Wed, Dec 16, 2015 at 12:13 PM, Jeff Zhang  wrote:

> oh, you are using S3. As I remember,  S3 has performance issue when
> processing large amount of files.
>
>
>
> On Wed, Dec 16, 2015 at 7:58 PM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> The HIVE table has very large number of partitions around 365 * 5 * 10
>> and when I say hivemetastore to start running queries on it (the one with
>> .count() or .show()) then it takes around 2 hours before the job starts in
>> SPARK.
>>
>> On the pyspark screen I can see that it is parsing the S3 locations for
>> these 2 hours.
>>
>> Regards,
>> Gourav
>>
>> On Wed, Dec 16, 2015 at 3:38 AM, Jeff Zhang  wrote:
>>
>>> >>> Currently it takes around 1.5 hours for me just to cache in the
>>> partition information and after that I can see that the job gets queued in
>>> the SPARK UI.
>>> I guess you mean the stage of getting the split info. I suspect it might
>>> be your cluster issue (or metadata store), unusually it won't take such
>>> long time for splitting.
>>>
>>> On Wed, Dec 16, 2015 at 8:06 AM, Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
 Hi,

 I have a HIVE table with few thousand partitions (based on date and
 time). It takes a long time to run if for the first time and then
 subsequently it is fast.

 Is there a way to store the cache of partition lookups so that every
 time I start a new SPARK instance (cannot keep my personal server running
 continuously), I can immediately restore back the temptable in hiveContext
 without asking it go again and cache the partition lookups?

 Currently it takes around 1.5 hours for me just to cache in the
 partition information and after that I can see that the job gets queued in
 the SPARK UI.

 Regards,
 Gourav

>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: hiveContext: storing lookup of partitions

2015-12-16 Thread Gourav Sengupta
The HIVE table has very large number of partitions around 365 * 5 * 10 and
when I say hivemetastore to start running queries on it (the one with
.count() or .show()) then it takes around 2 hours before the job starts in
SPARK.

On the pyspark screen I can see that it is parsing the S3 locations for
these 2 hours.

Regards,
Gourav

On Wed, Dec 16, 2015 at 3:38 AM, Jeff Zhang  wrote:

> >>> Currently it takes around 1.5 hours for me just to cache in the
> partition information and after that I can see that the job gets queued in
> the SPARK UI.
> I guess you mean the stage of getting the split info. I suspect it might
> be your cluster issue (or metadata store), unusually it won't take such
> long time for splitting.
>
> On Wed, Dec 16, 2015 at 8:06 AM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a HIVE table with few thousand partitions (based on date and
>> time). It takes a long time to run if for the first time and then
>> subsequently it is fast.
>>
>> Is there a way to store the cache of partition lookups so that every time
>> I start a new SPARK instance (cannot keep my personal server running
>> continuously), I can immediately restore back the temptable in hiveContext
>> without asking it go again and cache the partition lookups?
>>
>> Currently it takes around 1.5 hours for me just to cache in the partition
>> information and after that I can see that the job gets queued in the SPARK
>> UI.
>>
>> Regards,
>> Gourav
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


hiveContext: storing lookup of partitions

2015-12-15 Thread Gourav Sengupta
Hi,

I have a HIVE table with few thousand partitions (based on date and time).
It takes a long time to run if for the first time and then subsequently it
is fast.

Is there a way to store the cache of partition lookups so that every time I
start a new SPARK instance (cannot keep my personal server running
continuously), I can immediately restore back the temptable in hiveContext
without asking it go again and cache the partition lookups?

Currently it takes around 1.5 hours for me just to cache in the partition
information and after that I can see that the job gets queued in the SPARK
UI.

Regards,
Gourav


Re: hiveContext: storing lookup of partitions

2015-12-15 Thread Jeff Zhang
>>> Currently it takes around 1.5 hours for me just to cache in the
partition information and after that I can see that the job gets queued in
the SPARK UI.
I guess you mean the stage of getting the split info. I suspect it might be
your cluster issue (or metadata store), unusually it won't take such long
time for splitting.

On Wed, Dec 16, 2015 at 8:06 AM, Gourav Sengupta 
wrote:

> Hi,
>
> I have a HIVE table with few thousand partitions (based on date and time).
> It takes a long time to run if for the first time and then subsequently it
> is fast.
>
> Is there a way to store the cache of partition lookups so that every time
> I start a new SPARK instance (cannot keep my personal server running
> continuously), I can immediately restore back the temptable in hiveContext
> without asking it go again and cache the partition lookups?
>
> Currently it takes around 1.5 hours for me just to cache in the partition
> information and after that I can see that the job gets queued in the SPARK
> UI.
>
> Regards,
> Gourav
>



-- 
Best Regards

Jeff Zhang