Seperating mapred/ndfs and nutch search engine

Dominik Friedrich Sun, 15 Jan 2006 09:43:27 -0800

While going through the nutch sources for creating an updatednutch-default.xml I got some ideas.

Currently the mapred/ndfs engine is just seen as one part of nutch andso it makes sense to have mapred/ndfs properties set in the same file asthe rest of the nutch config properties. But as Doug mentioned in anearlier posting the mapred/ndfs part will eventually become a projectitself and then nutch is just one application for the mapred engine.

Since Stefan and some others are currently working on the NutchConfsystem I think this is a good moment to start separating the mapred/ndfspart from the actual nutch system to make it easier to create a seperatemapred/ndfs project later.

There are a lot of properties in nutch-default.xml that are set once tomake the cluster work, e.g. fs.default.name, mapred.local.dir orio.sort.mb. Those properties are independent of the application you runon the mapred engine. The configuration file(s) for these propertieshave to be in the local filesystem of the cluster nodes. The rest of theproperties in nutch-default.xml are specific to the application you run.So these properties should be in different config files and those configfiles should be in the ndfs so every tasktracker has access to thosefiles and a nutch gui could easily read and modify those files, too.

For the nutch application I propose creating three levels ofconfigurations, first the application level that sets default values toall properties, second an project level that can override default, andthird a domain level than again can overide the project and applicationlevel. With the extra project level it will be easier to use one clusterfor different projects with different configurations simultaniously,e.g. one production project and one testing project. The domain levelconfiguration files are part of a project and override properties forcertain domains, like number of fetchers, url filters, url normalizers,refetch interval and so on. The ndfs layout may look like this then:


/applications
 /nutch
   /conf             <-- default configuration for nutch projects

/classes <-- maybe put application specific classes in thendfs so new applications can be deployed easily

 /app2
   /conf
   /classes
/project1
 /conf               <-- project level configuration files
   /domains       <-- domain level configuration files, one file per domain
 /crawldb
 /index
 /indexes

/logs <-- job logging for this project is stored here,one logfile per job

 /segments
/project2
 ....

Like separating mapred/ndfs configurations from application/nutchconfigurations seperation cluster level logging from application levellogging would be useful, too. Namenode, datanode and jobtracker loggingoutput is on the cluster level and it makes sense to store it in thenodes local filesystem. Most of the tasktrackers output is applicationlevel logging. So my proposal is to start a new logfile for every job ona tasktracker and store thoses files in a projects log directory.Having one logfile for per job simplifies debugging and putting thosefiles in the ndfs makes them easily accessible by a nutch gui.


So, does this make sense to you?

best regards,
Dominik

Seperating mapred/ndfs and nutch search engine

Reply via email to