Hi,

Can you look up the file names of each mapper? You can do so by
looking at a running task UI in the status column. Also what split
property do you mean? Can you give your job's console output?
Also, the best recommended way is to use a splittable format like
Avro, Seq files, indexed LZO, etc. This way you won't bother about
file sizes so you can just make them big.

Best Regards

On Wed, Sep 26, 2012 at 4:12 AM, Connell, Chuck
<chuck.conn...@nuance.com> wrote:
> But remember that you are running on parallel machines. Depending on the
> hardware configuration, more map tasks is BETTER.
>
>
> ________________________________
> From: John Omernik [j...@omernik.com]
> Sent: Tuesday, September 25, 2012 7:11 PM
> To: user@hive.apache.org
> Subject: Re: Hive File Sizes, Merging, and Splits
>
> Isn't there an overhead associated with each map task?  Based on that, my
> hypothesis is if I pay attention to may data, merge up small files after
> load, and ensure split sizes are close to files sizes, I can keep the number
> of map tasks to an absolute minimum.
>
>
> On Tue, Sep 25, 2012 at 2:35 PM, Connell, Chuck <chuck.conn...@nuance.com>
> wrote:
>>
>> Why do you think the current generated code is inefficient?
>>
>>
>>
>>
>>
>>
>>
>> From: John Omernik [mailto:j...@omernik.com]
>> Sent: Tuesday, September 25, 2012 2:57 PM
>> To: user@hive.apache.org
>> Subject: Hive File Sizes, Merging, and Splits
>>
>>
>>
>> I am really struggling trying to make hears or tails out of how to
>> optimize the data in my tables for best query times.  I have a partition
>> that is compressed (Gzip) RCFile data in two files
>>
>>
>>
>> total 421877
>>
>> 263715 -rwxr-xr-x 1 darkness darkness 270044140 2012-09-25 13:32 000000_0
>>
>> 158162 -rwxr-xr-x 1 darkness darkness 161956948 2012-09-25 13:32 000001_0
>>
>>
>>
>>
>>
>>
>>
>> No matter what I set my split settings to prior to the job, I always get
>> three mappers.  My block size is 268435456 but the setting doesn't seem to
>> change anything. I can set split size huge or small with no apparent affect
>> on the data.
>>
>>
>>
>>
>>
>> I know there are many esoteric items here, but is there any good
>> documentation on setting these things to make my queries on this data more
>> efficient. I am not sure what it needs three map tasks on this data, it
>> should really just grab two mappers. Not to mention, I thought gzip wasn't
>> splitable anyhow.  So, from that standpoint, how does it even send data to
>> three mappers.  If you know of some secret cache of documentation for hive,
>> I'd love to read it.
>>
>>
>>
>> Thanks
>>
>>
>
>

Reply via email to