While going through the nutch sources for creating an updated
nutch-default.xml I got some ideas.
Currently the mapred/ndfs engine is just seen as one part of nutch and
so it makes sense to have mapred/ndfs properties set in the same file as
the rest of the nutch config properties. But as Doug mentioned in an
earlier posting the mapred/ndfs part will eventually become a project
itself and then nutch is just one application for the mapred engine.
Since Stefan and some others are currently working on the NutchConf
system I think this is a good moment to start separating the mapred/ndfs
part from the actual nutch system to make it easier to create a seperate
mapred/ndfs project later.
There are a lot of properties in nutch-default.xml that are set once to
make the cluster work, e.g. fs.default.name, mapred.local.dir or
io.sort.mb. Those properties are independent of the application you run
on the mapred engine. The configuration file(s) for these properties
have to be in the local filesystem of the cluster nodes. The rest of the
properties in nutch-default.xml are specific to the application you run.
So these properties should be in different config files and those config
files should be in the ndfs so every tasktracker has access to those
files and a nutch gui could easily read and modify those files, too.
For the nutch application I propose creating three levels of
configurations, first the application level that sets default values to
all properties, second an project level that can override default, and
third a domain level than again can overide the project and application
level. With the extra project level it will be easier to use one cluster
for different projects with different configurations simultaniously,
e.g. one production project and one testing project. The domain level
configuration files are part of a project and override properties for
certain domains, like number of fetchers, url filters, url normalizers,
refetch interval and so on. The ndfs layout may look like this then:
/applications
/nutch
/conf <-- default configuration for nutch projects
/classes <-- maybe put application specific classes in the
ndfs so new applications can be deployed easily
/app2
/conf
/classes
/project1
/conf <-- project level configuration files
/domains <-- domain level configuration files, one file per domain
/crawldb
/index
/indexes
/logs <-- job logging for this project is stored here,
one logfile per job
/segments
/project2
....
Like separating mapred/ndfs configurations from application/nutch
configurations seperation cluster level logging from application level
logging would be useful, too. Namenode, datanode and jobtracker logging
output is on the cluster level and it makes sense to store it in the
nodes local filesystem. Most of the tasktrackers output is application
level logging. So my proposal is to start a new logfile for every job on
a tasktracker and store thoses files in a projects log directory.
Having one logfile for per job simplifies debugging and putting those
files in the ndfs makes them easily accessible by a nutch gui.
So, does this make sense to you?
best regards,
Dominik