oh, you are using S3. As I remember,  S3 has performance issue when
processing large amount of files.



On Wed, Dec 16, 2015 at 7:58 PM, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> The HIVE table has very large number of partitions around 365 * 5 * 10 and
> when I say hivemetastore to start running queries on it (the one with
> .count() or .show()) then it takes around 2 hours before the job starts in
> SPARK.
>
> On the pyspark screen I can see that it is parsing the S3 locations for
> these 2 hours.
>
> Regards,
> Gourav
>
> On Wed, Dec 16, 2015 at 3:38 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> >>> Currently it takes around 1.5 hours for me just to cache in the
>> partition information and after that I can see that the job gets queued in
>> the SPARK UI.
>> I guess you mean the stage of getting the split info. I suspect it might
>> be your cluster issue (or metadata store), unusually it won't take such
>> long time for splitting.
>>
>> On Wed, Dec 16, 2015 at 8:06 AM, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a HIVE table with few thousand partitions (based on date and
>>> time). It takes a long time to run if for the first time and then
>>> subsequently it is fast.
>>>
>>> Is there a way to store the cache of partition lookups so that every
>>> time I start a new SPARK instance (cannot keep my personal server running
>>> continuously), I can immediately restore back the temptable in hiveContext
>>> without asking it go again and cache the partition lookups?
>>>
>>> Currently it takes around 1.5 hours for me just to cache in the
>>> partition information and after that I can see that the job gets queued in
>>> the SPARK UI.
>>>
>>> Regards,
>>> Gourav
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


-- 
Best Regards

Jeff Zhang

Reply via email to