Separate the build and runtime environments
-------------------------------------------

                 Key: NUTCH-843
                 URL: https://issues.apache.org/jira/browse/NUTCH-843
             Project: Nutch
          Issue Type: Improvement
          Components: build
    Affects Versions: 2.0
            Reporter: Andrzej Bialecki 
            Assignee: Andrzej Bialecki 


Currently there is no clean separation of source, build and runtime artifacts. 
On one hand, it makes it easier to get started in local mode, but on the other 
hand it makes the distributed (or pseudo-distributed) setup much more 
challenging and tricky. Also, some resources (config files and classes) are 
included several times on the classpath, they are loaded under different 
classloaders, and in the end it's not obvious what copy and why takes 
precedence.

Here's an example of a harmful unintended behavior caused by this mess: Hadoop 
daemons (jobtracker and tasktracker) will get conf/ and build/ on their 
classpath. This means that a task running on this cluster will have two copies 
of resources from these locations - one from the inherited classpath from 
tasktracker, and the other one from the just unpacked nutch.job file. If these 
two versions differ, only the first one will be loaded, which in this case is 
the one taken from the (unpacked) conf/ and build/ - the other one, from within 
the nutch.job file, will be ignored.

It's even worse when you add more nodes to the cluster - the nutch.job will be 
shipped to the new nodes as a part of each task setup, but now the remote 
tasktracker child processes will use resources from nutch.job - so some tasks 
will use different versions of resources than other tasks. This usually leads 
to a host of very difficult to debug issues.

This issue proposes then to separate these environments into the following 
areas:

* source area - i.e. our current sources. Note that bin/ scripts will belong to 
this category too, so there will be no top-level bin/. nutch-default.xml 
belongs to this category too. Other customizable files can be moved to src/conf 
too, or they could stay in top-level conf/ as today, with a README that 
explains that changes made there take effect only after you rebuild the job jar.

* build area - contains build artifacts, among them the nutch.job jar.

* runtime (or deploy) area - this area contains all artifacts needed to run 
Nutch jobs. For a distributed setup that uses an existing Hadoop cluster 
(installed from plain vanilla Hadoop release) this will be a {{/deploy}} 
directory, where we put the following:
{code}
bin/nutch
nutch.job
{code}
That's it - nothing else should be needed, because all other resources are 
already included in the job jar. These resources can be copied directly to the 
master Hadoop node.

For a local setup (using LocalJobTracker) this will be a {{/runtime}} 
directory, where we put the following:
{code}
bin/nutch
lib/hadoop-libs
plugins/
nutch.job
{code}
Due to limitations in the PluginClassLoader the local runtime requires that the 
plugins/ directory be unpacked from the job jar. And we need the hadoop libs to 
run in the local mode. We may later on refine this local setup to something 
like this:
{code}
bin/nutch
conf/
lib/hadoop-libs
lib/nutch-libs
plugins/
nutch.jar
{code}
so that it's easier to modify the config without rebuilding the job jar (which 
actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to