[ http://issues.apache.org/jira/browse/NUTCH-209?page=comments#action_12365798 ]
Doug Cutting commented on NUTCH-209: ------------------------------------ Andrzej, sorry, I didn't see your remark before I committed this! A DFSClassLoader would have problems with plugins, since our plugin mechanism requires that we list a directory to find all defined plugins, and the ClassLoader API doesn't let one list directories. That could be fixed, but it's not trivial. Another way to address this concern is to permit one to specify different levels of DFS replication for different files. So, while the default might be 3, a job jar file might be replicated much more, so that individual nodes are not hit too hard by requests. This is a feature that I believe Google implements, and one that folks at Yahoo! (who're now contributing to Hadoop) would like to add to Hadoop. We could also try to make the job jar smaller, e.g., by only including enabled plugins. > include nutch jar in mapred jobs > -------------------------------- > > Key: NUTCH-209 > URL: http://issues.apache.org/jira/browse/NUTCH-209 > Project: Nutch > Type: Improvement > Versions: 0.8-dev > Reporter: Doug Cutting > Priority: Minor > Fix For: 0.8-dev > > I just added a simple way in Hadoop to specify the job jar file. When > constructing a JobConf one can specify a class whose containing jar is set to > be the job's jar. To take advantage of this in Nutch, we could add a util > class: > public class NutchJob extends JobConf { > public NutchJob(Configuration conf) { > super(conf, NutchJob.class); > } > } > Then change all of the places where we construct a JobConf to instead > construct a NutchJob. > Finally, we should add an ant target called 'job' that constructs a job jar, > containing all of the classes and the plugins, and make this the default > target. This way all Nutch code can be distributed with each job as it is > submitted, and daemons would only need to be restarted when Hadoop code is > updated. > Does this sound reasonable? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
