I should have pointed out that Nutch index build and contrib/index
targets different applications. The latter is for applications who
simply want to build Lucene index from a set of documents - e.g. no
link analysis.

As to writing Lucene indexes, both work the same way - write the final
results to local file system and then copy to HDFS. In contrib/index,
the intermediate results are in memory and not written to HDFS.

Hope it clarifies things.

Cheers,
Ning


On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff <ian.sobor...@nist.gov> wrote:
>
> I understand why you would index in the reduce phase, because the anchor
> text gets shuffled to be next to the document.  However, when you index
> in the map phase, don't you just have to reindex later?
>
> The main point to the OP is that HDFS is a bad FS for writing Lucene
> indexes because of how Lucene works.  The simple approach is to write
> your index outside of HDFS in the reduce phase, and then merge the
> indexes from each reducer manually.
>
> Ian
>
> Ning Li <ning.li...@gmail.com> writes:
>
>> Or you can check out the index contrib. The difference of the two is that:
>>   - In Nutch's indexing map/reduce job, indexes are built in the
>> reduce phase. Afterwards, they are merged into smaller number of
>> shards if necessary. The last time I checked, the merge process does
>> not use map/reduce.
>>   - In contrib/index, small indexes are built in the map phase. They
>> are merged into the desired number of shards in the reduce phase. In
>> addition, they can be merged into existing shards.
>>
>> Cheers,
>> Ning
>>
>>
>> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 <imcap...@126.com> wrote:
>>> you can see the nutch code.
>>>
>>> 2009/3/13 Mark Kerzner <markkerz...@gmail.com>
>>>
>>>> Hi,
>>>>
>>>> How do I allow multiple nodes to write to the same index file in HDFS?
>>>>
>>>> Thank you,
>>>> Mark
>>>>
>>>
>
>

Reply via email to