Unleash ze file crusha!

https://github.com/edwardcapriolo/filecrush


On Fri, Jul 18, 2014 at 10:51 AM, diogo <di...@uken.com> wrote:

> Sweet, great answers, thanks.
>
> Indeed, I have a small number of partitions, but lots of small files,
> ~20MB each. I'll make sure to combine them. Also, increasing the heap size
> of the cli process already helped speed it up.
>
> Thanks, again.
>
>
> On Fri, Jul 18, 2014 at 10:26 AM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>> The planning phase needs to do work for every hive partition and every
>> hadoop files. If you have a lot of 'small' files or many partitions this
>> can take a long time.
>> Also the planning phase that happens on the job tracker is single
>> threaded.
>> Also the new yarn stuff requires back and forth to allocated containers.
>>
>> Sometimes raising the heap to for the hive-cli/launching process helps
>> because the default heap of 1 GB may not be a lot of space to deal with all
>> of the partition information and memory overhead will make this go faster.
>> Sometimes setting the min split size higher launches less map tasks which
>> speeds up everything.
>>
>> So the answer...Try to tune everything, start hive like this:
>>
>> bin/hive -hiveconf hive.root.logger=DEBUG,console
>>
>> And record where the longest spaces with no output are, that is what you
>> should try to tune first.
>>
>>
>>
>>
>> On Fri, Jul 18, 2014 at 9:36 AM, diogo <di...@uken.com> wrote:
>>
>>> This is probably a simple question, but I'm noticing that for queries
>>> that run on 1+TB of data, it can take Hive up to 30 minutes to actually
>>> start the first map-reduce stage. What is it doing? I imagine it's
>>> gathering information about the data somehow, this 'startup' time is
>>> clearly a function of the amount of data I'm trying to process.
>>>
>>> Cheers,
>>>
>>
>>
>

Reply via email to