thanks! I'll try overriding the run method first. On Wed, Oct 12, 2011 at 3:18 PM, Harsh J <ha...@cloudera.com> wrote:
> Yaron, > > That would certainly seem to be the easy way out, with the only > negative side being that you'd have to cache your values in memory. > > If you plug deeper down into the RecordReader levels (which provide > the specific nextKV(…) methods), you can perhaps keep just a list of > offsets of all successful line matches and re-read the whole split in > the second run. This would cost you slightly higher I/O as you seek > through once again, but the benefit would be lower memory consumption > -- if that can be a concern here. > > [Or go the longer way, and use the Reducer phase!] > > On Wed, Oct 12, 2011 at 5:14 PM, Yaron Gonen <yaron.go...@gmail.com> > wrote: > > Thanks for the fast reply! > > I've dug in the code a little bit, and it seems to me that I can achieve > my > > goal by overloading Mapper.run method: just iterate over the whole split > by > > using context.nextKeyValue() and then call map only with the values I > need. > > Since I'm a novice Hadooper, am I thinking it the wrong way? > > > > thanks again, > > yaron > > > > On Wed, Oct 12, 2011 at 12:44 PM, Harsh J <ha...@cloudera.com> wrote: > >> > >> Hello Yaron, > >> > >> Yes, this is possible to do. > >> > >> You need to plug in your own RecordReader implementation into the job, > >> to control the emits and the action done before feeding key-value pair > >> data into map(…). > >> > >> On Wed, Oct 12, 2011 at 2:42 PM, Yaron Gonen <yaron.go...@gmail.com> > >> wrote: > >> > Hi, > >> > The map method in the Mapper gets as a parameter a single line from > the > >> > split. Is there a way for Mappers to get the whole split as input? > >> > I'd like to scan the whole split before I decide which key-value pairs > >> > to > >> > emit to the reducer. > >> > Thanks > >> > yaron > >> > > >> > >> > >> > >> -- > >> Harsh J > > > > > > > > -- > Harsh J >