[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887000#action_12887000 ]
Pham Tuan Minh commented on NUTCH-843: -------------------------------------- Hi, I found that after building runtime, In nutch-2.0-dev.job and local\lib directory contains different versions of the same library ant-1.7.1.jar ant-1.6.5.jar servlet-api-2.5-20081211.jar servlet-api-2.5-6.1.14.jar Thanks, > Separate the build and runtime environments > ------------------------------------------- > > Key: NUTCH-843 > URL: https://issues.apache.org/jira/browse/NUTCH-843 > Project: Nutch > Issue Type: Improvement > Components: build > Affects Versions: 2.0 > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: NUTCH-843.patch, NUTCH-843.patch > > > Currently there is no clean separation of source, build and runtime > artifacts. On one hand, it makes it easier to get started in local mode, but > on the other hand it makes the distributed (or pseudo-distributed) setup much > more challenging and tricky. Also, some resources (config files and classes) > are included several times on the classpath, they are loaded under different > classloaders, and in the end it's not obvious what copy and why takes > precedence. > Here's an example of a harmful unintended behavior caused by this mess: > Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on > their classpath. This means that a task running on this cluster will have two > copies of resources from these locations - one from the inherited classpath > from tasktracker, and the other one from the just unpacked nutch.job file. If > these two versions differ, only the first one will be loaded, which in this > case is the one taken from the (unpacked) conf/ and build/ - the other one, > from within the nutch.job file, will be ignored. > It's even worse when you add more nodes to the cluster - the nutch.job will > be shipped to the new nodes as a part of each task setup, but now the remote > tasktracker child processes will use resources from nutch.job - so some tasks > will use different versions of resources than other tasks. This usually leads > to a host of very difficult to debug issues. > This issue proposes then to separate these environments into the following > areas: > * source area - i.e. our current sources. Note that bin/ scripts will belong > to this category too, so there will be no top-level bin/. nutch-default.xml > belongs to this category too. Other customizable files can be moved to > src/conf too, or they could stay in top-level conf/ as today, with a README > that explains that changes made there take effect only after you rebuild the > job jar. > * build area - contains build artifacts, among them the nutch.job jar. > * runtime (or deploy) area - this area contains all artifacts needed to run > Nutch jobs. For a distributed setup that uses an existing Hadoop cluster > (installed from plain vanilla Hadoop release) this will be a {{/deploy}} > directory, where we put the following: > {code} > bin/nutch > nutch.job > {code} > That's it - nothing else should be needed, because all other resources are > already included in the job jar. These resources can be copied directly to > the master Hadoop node. > For a local setup (using LocalJobTracker) this will be a {{/runtime}} > directory, where we put the following: > {code} > bin/nutch > lib/hadoop-libs > plugins/ > nutch.job > {code} > Due to limitations in the PluginClassLoader the local runtime requires that > the plugins/ directory be unpacked from the job jar. And we need the hadoop > libs to run in the local mode. We may later on refine this local setup to > something like this: > {code} > bin/nutch > conf/ > lib/hadoop-libs > lib/nutch-libs > plugins/ > nutch.jar > {code} > so that it's easier to modify the config without rebuilding the job jar > (which actually would not be used in this case). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.