While going through the nutch sources for creating an updated nutch-default.xml I got some ideas.

Currently the mapred/ndfs engine is just seen as one part of nutch and so it makes sense to have mapred/ndfs properties set in the same file as the rest of the nutch config properties. But as Doug mentioned in an earlier posting the mapred/ndfs part will eventually become a project itself and then nutch is just one application for the mapred engine.

Since Stefan and some others are currently working on the NutchConf system I think this is a good moment to start separating the mapred/ndfs part from the actual nutch system to make it easier to create a seperate mapred/ndfs project later.

There are a lot of properties in nutch-default.xml that are set once to make the cluster work, e.g. fs.default.name, mapred.local.dir or io.sort.mb. Those properties are independent of the application you run on the mapred engine. The configuration file(s) for these properties have to be in the local filesystem of the cluster nodes. The rest of the properties in nutch-default.xml are specific to the application you run. So these properties should be in different config files and those config files should be in the ndfs so every tasktracker has access to those files and a nutch gui could easily read and modify those files, too.

For the nutch application I propose creating three levels of configurations, first the application level that sets default values to all properties, second an project level that can override default, and third a domain level than again can overide the project and application level. With the extra project level it will be easier to use one cluster for different projects with different configurations simultaniously, e.g. one production project and one testing project. The domain level configuration files are part of a project and override properties for certain domains, like number of fetchers, url filters, url normalizers, refetch interval and so on. The ndfs layout may look like this then:

/applications
 /nutch
   /conf             <-- default configuration for nutch projects
/classes <-- maybe put application specific classes in the ndfs so new applications can be deployed easily
 /app2
   /conf
   /classes
/project1
 /conf               <-- project level configuration files
   /domains       <-- domain level configuration files, one file per domain
 /crawldb
 /index
 /indexes
/logs <-- job logging for this project is stored here, one logfile per job
 /segments
/project2
 ....

Like separating mapred/ndfs configurations from application/nutch configurations seperation cluster level logging from application level logging would be useful, too. Namenode, datanode and jobtracker logging output is on the cluster level and it makes sense to store it in the nodes local filesystem. Most of the tasktrackers output is application level logging. So my proposal is to start a new logfile for every job on a tasktracker and store thoses files in a projects log directory. Having one logfile for per job simplifies debugging and putting those files in the ndfs makes them easily accessible by a nutch gui.

So, does this make sense to you?

best regards,
Dominik

Reply via email to