I should have pointed out that Nutch index build and contrib/index targets different applications. The latter is for applications who simply want to build Lucene index from a set of documents - e.g. no link analysis.
As to writing Lucene indexes, both work the same way - write the final results to local file system and then copy to HDFS. In contrib/index, the intermediate results are in memory and not written to HDFS. Hope it clarifies things. Cheers, Ning On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff <[email protected]> wrote: > > I understand why you would index in the reduce phase, because the anchor > text gets shuffled to be next to the document. However, when you index > in the map phase, don't you just have to reindex later? > > The main point to the OP is that HDFS is a bad FS for writing Lucene > indexes because of how Lucene works. The simple approach is to write > your index outside of HDFS in the reduce phase, and then merge the > indexes from each reducer manually. > > Ian > > Ning Li <[email protected]> writes: > >> Or you can check out the index contrib. The difference of the two is that: >> - In Nutch's indexing map/reduce job, indexes are built in the >> reduce phase. Afterwards, they are merged into smaller number of >> shards if necessary. The last time I checked, the merge process does >> not use map/reduce. >> - In contrib/index, small indexes are built in the map phase. They >> are merged into the desired number of shards in the reduce phase. In >> addition, they can be merged into existing shards. >> >> Cheers, >> Ning >> >> >> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 <[email protected]> wrote: >>> you can see the nutch code. >>> >>> 2009/3/13 Mark Kerzner <[email protected]> >>> >>>> Hi, >>>> >>>> How do I allow multiple nodes to write to the same index file in HDFS? >>>> >>>> Thank you, >>>> Mark >>>> >>> > >
