Hello everyone, 

Thanks for sharing valuable inputs. I am working on similar kind of task, it 
will be really helpful if you can share the command for  increasing the heap 
size of hive-cli/launching process. 

Thanks,
Saurabh

Sent from my iPhone, please avoid typos.

> On 18-Jul-2014, at 8:23 pm, Edward Capriolo <edlinuxg...@gmail.com> wrote:
> 
> Unleash ze file crusha!
> 
> https://github.com/edwardcapriolo/filecrush
> 
> 
>> On Fri, Jul 18, 2014 at 10:51 AM, diogo <di...@uken.com> wrote:
>> Sweet, great answers, thanks.
>> 
>> Indeed, I have a small number of partitions, but lots of small files, ~20MB 
>> each. I'll make sure to combine them. Also, increasing the heap size of the 
>> cli process already helped speed it up.
>> 
>> Thanks, again.
>> 
>> 
>>> On Fri, Jul 18, 2014 at 10:26 AM, Edward Capriolo <edlinuxg...@gmail.com> 
>>> wrote:
>>> The planning phase needs to do work for every hive partition and every 
>>> hadoop files. If you have a lot of 'small' files or many partitions this 
>>> can take a long time. 
>>> Also the planning phase that happens on the job tracker is single threaded.
>>> Also the new yarn stuff requires back and forth to allocated containers. 
>>> 
>>> Sometimes raising the heap to for the hive-cli/launching process helps 
>>> because the default heap of 1 GB may not be a lot of space to deal with all 
>>> of the partition information and memory overhead will make this go faster.
>>> Sometimes setting the min split size higher launches less map tasks which 
>>> speeds up everything.
>>> 
>>> So the answer...Try to tune everything, start hive like this:
>>> 
>>> bin/hive -hiveconf hive.root.logger=DEBUG,console
>>> 
>>> And record where the longest spaces with no output are, that is what you 
>>> should try to tune first.
>>> 
>>> 
>>> 
>>> 
>>>> On Fri, Jul 18, 2014 at 9:36 AM, diogo <di...@uken.com> wrote:
>>>> This is probably a simple question, but I'm noticing that for queries that 
>>>> run on 1+TB of data, it can take Hive up to 30 minutes to actually start 
>>>> the first map-reduce stage. What is it doing? I imagine it's gathering 
>>>> information about the data somehow, this 'startup' time is clearly a 
>>>> function of the amount of data I'm trying to process.
>>>> 
>>>> Cheers,
> 

Reply via email to