Thanks Lewis and Markus. @Lewis: I don't have a dedicated cluster (I am currently not a student nor working anywhere) so would be running in the pseudo distributed mode on my laptop. I don't think that it would be a perfect setup to get some stats. Does ASF has any cluster which could be used ?
Thanks, Tejas On Mon, Jan 6, 2014 at 6:54 AM, Markus Jelsma <markus.jel...@openindex.io>wrote: > Hi - Yes, MultipleInputs works very well, i did that too when coding the > HostDB. The MultipleInputs class was not available when the injector was > originally written, it was introduced around 0.19 or 0.20. I see no reason > not to replace this so +1 for an new ticket. If unit tests pass, we're good > to go. > > -----Original message----- > From: Lewis John Mcgibbney<lewis.mcgibb...@gmail.com> > Sent: Monday 6th January 2014 15:40 > To: dev@nutch.apache.org > Subject: Re: Inject operation: can't it be done in a single map-reduce job > ? > > Hi Tejas, > > On Sat, Jan 4, 2014 at 8:01 AM, <dev-digest-h...@nutch.apache.org<mailto: > dev-digest-h...@nutch.apache.org>> wrote: > > I realized that by using MultipleInputs, we can read CrawlDatum objects > from crawldb and urls from seeds file simultaneously and perform inject in > a single map-reduce job. PFA Injector2.java which is an implementation of > this approach. I did some basic testing on it and so far I have not > encountered any problems. > > Dynamite Tejas. I would kindly ask that you open an issue and apply your > patch against trunk :) > > I am not sure why Injector was not written this way which is more > efficient than the one currently in trunk (maybe MultipleInputs was later > added in Hadoop). > > As far as I have discovered, joins have been available in Hadoops mapred > package and subsequently in mapreduce package so it may not be a case of > them not being available... however this goes to no length to explain why > the Injector was not written in this way. > > Wondering if I am wrong somewhere in my understanding. Any comments about > this ? > > I am curious to discover how more efficient using the MultipleInputss > class is over the sequential MR jobs as is currently implemented. Do you > have any comparison on the size of the dataset being used? > > There is a script [0] I keep on my github which we can test this against > (1M URLs). This would provide a reasonable input dataset which we could use > to base some efficiency tests on. > > Great observations Tejas. > > Lewis > > [0] https://github.com/lewismc/nipt <https://github.com/lewismc/nipt> > > >