I understand why you would index in the reduce phase, because the anchor
text gets shuffled to be next to the document.  However, when you index
in the map phase, don't you just have to reindex later?

The main point to the OP is that HDFS is a bad FS for writing Lucene
indexes because of how Lucene works.  The simple approach is to write
your index outside of HDFS in the reduce phase, and then merge the
indexes from each reducer manually.

Ian

Ning Li <ning.li...@gmail.com> writes:

> Or you can check out the index contrib. The difference of the two is that:
>   - In Nutch's indexing map/reduce job, indexes are built in the
> reduce phase. Afterwards, they are merged into smaller number of
> shards if necessary. The last time I checked, the merge process does
> not use map/reduce.
>   - In contrib/index, small indexes are built in the map phase. They
> are merged into the desired number of shards in the reduce phase. In
> addition, they can be merged into existing shards.
>
> Cheers,
> Ning
>
>
> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 <imcap...@126.com> wrote:
>> you can see the nutch code.
>>
>> 2009/3/13 Mark Kerzner <markkerz...@gmail.com>
>>
>>> Hi,
>>>
>>> How do I allow multiple nodes to write to the same index file in HDFS?
>>>
>>> Thank you,
>>> Mark
>>>
>>

Reply via email to