Hi Jeff,

sadly that does not resolve the issue. I am sure that the memory mapping to
physical files locations can be saved and recovered in SPARK.


Regards,
Gourav Sengupta

On Wed, Dec 16, 2015 at 12:13 PM, Jeff Zhang <zjf...@gmail.com> wrote:

> oh, you are using S3. As I remember,  S3 has performance issue when
> processing large amount of files.
>
>
>
> On Wed, Dec 16, 2015 at 7:58 PM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> The HIVE table has very large number of partitions around 365 * 5 * 10
>> and when I say hivemetastore to start running queries on it (the one with
>> .count() or .show()) then it takes around 2 hours before the job starts in
>> SPARK.
>>
>> On the pyspark screen I can see that it is parsing the S3 locations for
>> these 2 hours.
>>
>> Regards,
>> Gourav
>>
>> On Wed, Dec 16, 2015 at 3:38 AM, Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>> >>> Currently it takes around 1.5 hours for me just to cache in the
>>> partition information and after that I can see that the job gets queued in
>>> the SPARK UI.
>>> I guess you mean the stage of getting the split info. I suspect it might
>>> be your cluster issue (or metadata store), unusually it won't take such
>>> long time for splitting.
>>>
>>> On Wed, Dec 16, 2015 at 8:06 AM, Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a HIVE table with few thousand partitions (based on date and
>>>> time). It takes a long time to run if for the first time and then
>>>> subsequently it is fast.
>>>>
>>>> Is there a way to store the cache of partition lookups so that every
>>>> time I start a new SPARK instance (cannot keep my personal server running
>>>> continuously), I can immediately restore back the temptable in hiveContext
>>>> without asking it go again and cache the partition lookups?
>>>>
>>>> Currently it takes around 1.5 hours for me just to cache in the
>>>> partition information and after that I can see that the job gets queued in
>>>> the SPARK UI.
>>>>
>>>> Regards,
>>>> Gourav
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Reply via email to