The MultiThreadedMapper won't solve your problem, as all it does is run
parallel maps within the same map task JVM as a non-MT one. Your data
structure won't be shared across the different map task JVMs on the host,
but just within the map tasks's own multiple threads running the map()
function over input records.

Wouldn't doing reduce-side join for larger files be much faster?

On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes <pedror...@gmail.com> wrote:

> I am developig a job that has 30B of records in the input path. (File A)
> I need to filter these records using another file that can have 30K to
> 180M of records. (File B)
> So fo each record in File A, i will make a lookup in File B.
> I am using distributed cache to share the File B. The problem is that if
> the File B is too large (for example 180 M of records), i spend too much
> time (CPU processing) allocating it in a hashmap. I make this allocation to
> each map task.
>
> In hadoop 2.X the jvm reuse was discontinued. So i am think in use 
> MultithreadedMapper,
> making the hashmap thread-safe, and sharing this read-only structure across
> the mappers.
>
> Is this a good approach?
>
>
>
>

Reply via email to