Perhaps combining MultiThreaded mapper along with a CombineFileInputFormat may help (it reduces total # of maps, but you get more threads per map task).
On Mon, Aug 24, 2015 at 2:16 PM twinkle sachdeva <twinkle.sachd...@gmail.com> wrote: > Hi, > > We have been using the jvm reuse feature for the same reason of sharing > the same structure across multiple Map Tasks. Multithreaded Map task does > that partially, as within the multiple threads, same copy is used. > > > Depending upon the hardware availability, one can get the same performance. > > Thanks, > > > On Mon, Aug 24, 2015 at 1:37 PM, Harsh J <ha...@cloudera.com> wrote: > >> The MultiThreadedMapper won't solve your problem, as all it does is run >> parallel maps within the same map task JVM as a non-MT one. Your data >> structure won't be shared across the different map task JVMs on the host, >> but just within the map tasks's own multiple threads running the map() >> function over input records. >> >> Wouldn't doing reduce-side join for larger files be much faster? >> >> On Sun, Aug 23, 2015 at 5:08 AM Pedro Magalhaes <pedror...@gmail.com> >> wrote: >> >>> I am developig a job that has 30B of records in the input path. (File A) >>> I need to filter these records using another file that can have 30K to >>> 180M of records. (File B) >>> So fo each record in File A, i will make a lookup in File B. >>> I am using distributed cache to share the File B. The problem is that if >>> the File B is too large (for example 180 M of records), i spend too much >>> time (CPU processing) allocating it in a hashmap. I make this allocation to >>> each map task. >>> >>> In hadoop 2.X the jvm reuse was discontinued. So i am think in use >>> MultithreadedMapper, >>> making the hashmap thread-safe, and sharing this read-only structure across >>> the mappers. >>> >>> Is this a good approach? >>> >>> >>> >>> >> >