Hi, On Thu, Oct 27, 2011 at 3:22 AM, Arko Provo Mukherjee <arkoprovomukher...@gmail.com> wrote: > Hi, > > I have a situation where I have to read a large file into every mapper. > > Since its a large HDFS file that is needed to work on each input to the > mapper, it is taking a lot of time to read the data into the memory from > HDFS. > > Thus the system is killing all my Mappers with the following message: > > 11/10/26 22:54:52 INFO mapred.JobClient: Task Id : > attempt_201106271322_12504_m_000000_0, Status : FAILED > Task attempt_201106271322_12504_m_000000_0 failed to report status for 601 > seconds. Killing! > > The cluster is not entirely owned by me and hence I cannot change > the mapred.task.timeout so as to be able to read the entire file. > Any suggestions? > Also, is there a way such that a Mapper instance reads the file once for all > the inputs that it receives. > Currently, since the file reading code is in the map method, I guess its > reading the entire file for each and every input leading to a lot of > overhead.
The file should be read in, in the configure() (old api) or setup() (new api) method. Brock