Hello,

When I use an custom input format, as in the nutch project - do I have to
keep my index in DFS, or regular file system?

By the way, are there any alternatives to nutch?


Best Regards


-C.B.

On Fri, Jun 27, 2008 at 10:08 AM, Enis Soztutar <[EMAIL PROTECTED]>
wrote:

> Cam Bazz wrote:
>
>> hello,
>>
>> I have a lucene index storing documents which holds src and dst words.
>> word
>> pairs may repeat. (it is a multigraph).
>>
>> I want to use hadoop to count how many of the same word pairs there are. I
>> have looked at the aggregateword count example, and I understand that if I
>> make a txt file
>> such as
>>
>> src1>dst2
>> src2>dst2
>> src1>dst2
>>
>> ..
>>
>> and use something similar to the aggregate word count example, I will get
>> the result desired.
>>
>> Now questions. how can I hookup my lucene index to hadoop. is there a
>> better
>> way then dumping the index to a text file with >'s, copying this to dfs
>> and
>> getting the results back?
>>
>>
> Yes, you can implement an InputFormat to read from the lucene index. You
> can use the implementation in the nutch project, the classes
> DeleteDuplicates$InputFormat, DeleteDuplicates$DDRecordReader.
>
>> how can I make incremental runs? (once the index processed and I got the
>> results, how can I dump more data onto it so it does not start from
>> beginning)
>>
>>
> As far as i know, there is no easy way for this. Why do you keep your data
> as a lucene index?
>
>> Best regards,
>>
>> -C.B.
>>
>>
>>
>

Reply via email to