Perhaps combining MultiThreaded mapper along with a CombineFileInputFormat
may help (it reduces total # of maps, but you get more threads per map
task).

On Mon, Aug 24, 2015 at 2:16 PM twinkle sachdeva <twinkle.sachd...@gmail.com>
wrote:

> Hi,
>
> We have been using the jvm reuse feature for the same reason of sharing
> the same structure across multiple Map Tasks. Multithreaded Map task does
> that partially, as within the multiple threads, same copy is used.
>
>
> Depending upon the hardware availability, one can get the same performance.
>
> Thanks,
>
>
> On Mon, Aug 24, 2015 at 1:37 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> The MultiThreadedMapper won't solve your problem, as all it does is run
>> parallel maps within the same map task JVM as a non-MT one. Your data
>> structure won't be shared across the different map task JVMs on the host,
>> but just within the map tasks's own multiple threads running the map()
>> function over input records.
>>
>> Wouldn't doing reduce-side join for larger files be much faster?
>>
>> On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes <pedror...@gmail.com>
>> wrote:
>>
>>> I am developig a job that has 30B of records in the input path. (File A)
>>> I need to filter these records using another file that can have 30K to
>>> 180M of records. (File B)
>>> So fo each record in File A, i will make a lookup in File B.
>>> I am using distributed cache to share the File B. The problem is that if
>>> the File B is too large (for example 180 M of records), i spend too much
>>> time (CPU processing) allocating it in a hashmap. I make this allocation to
>>> each map task.
>>>
>>> In hadoop 2.X the jvm reuse was discontinued. So i am think in use 
>>> MultithreadedMapper,
>>> making the hashmap thread-safe, and sharing this read-only structure across
>>> the mappers.
>>>
>>> Is this a good approach?
>>>
>>>
>>>
>>>
>>
>

Reply via email to