Hi all, I am looking into reusing some existing code for distributed indexing to test a Mahout tool I am working on https://issues.apache.org/jira/browse/MAHOUT-944
What I want is to index the Apache Public Mail Archives dataset (200G) via MapReduce on Hadoop. I have been going through the Nutch and contrib/index code and from my understanding I have to: * Create an InputFormat / RecordReader / InputSplit class for splitting the e-mails across mappers * Create a Mapper which emits the e-mails as key value pairs * Create a Reducer which indexes the e-mails on the local filesystem (or straight to HDFS?) * Copy these indexes from local filesystem to HDFS. In the same Reducer? I am unsure about the final steps. How to get to the end result, a bunch of index shards on HDFS. It seems that each Reducer needs to be aware of a directory they eventually write to on HDFS. I don't see how to get each reducer to copy its shard to HDFS How do I set this up? Cheers, Frank