Cam Bazz wrote:
hello,

I have a lucene index storing documents which holds src and dst words. word
pairs may repeat. (it is a multigraph).

I want to use hadoop to count how many of the same word pairs there are. I
have looked at the aggregateword count example, and I understand that if I
make a txt file
such as

src1>dst2
src2>dst2
src1>dst2

..

and use something similar to the aggregate word count example, I will get
the result desired.

Now questions. how can I hookup my lucene index to hadoop. is there a better
way then dumping the index to a text file with >'s, copying this to dfs and
getting the results back?
Yes, you can implement an InputFormat to read from the lucene index. You can use the implementation in the nutch project, the classes DeleteDuplicates$InputFormat, DeleteDuplicates$DDRecordReader.
how can I make incremental runs? (once the index processed and I got the
results, how can I dump more data onto it so it does not start from
beginning)
As far as i know, there is no easy way for this. Why do you keep your data as a lucene index?
Best regards,

-C.B.

Reply via email to