Hello, When I use an custom input format, as in the nutch project - do I have to keep my index in DFS, or regular file system?
By the way, are there any alternatives to nutch? Best Regards -C.B. On Fri, Jun 27, 2008 at 10:08 AM, Enis Soztutar <[EMAIL PROTECTED]> wrote: > Cam Bazz wrote: > >> hello, >> >> I have a lucene index storing documents which holds src and dst words. >> word >> pairs may repeat. (it is a multigraph). >> >> I want to use hadoop to count how many of the same word pairs there are. I >> have looked at the aggregateword count example, and I understand that if I >> make a txt file >> such as >> >> src1>dst2 >> src2>dst2 >> src1>dst2 >> >> .. >> >> and use something similar to the aggregate word count example, I will get >> the result desired. >> >> Now questions. how can I hookup my lucene index to hadoop. is there a >> better >> way then dumping the index to a text file with >'s, copying this to dfs >> and >> getting the results back? >> >> > Yes, you can implement an InputFormat to read from the lucene index. You > can use the implementation in the nutch project, the classes > DeleteDuplicates$InputFormat, DeleteDuplicates$DDRecordReader. > >> how can I make incremental runs? (once the index processed and I got the >> results, how can I dump more data onto it so it does not start from >> beginning) >> >> > As far as i know, there is no easy way for this. Why do you keep your data > as a lucene index? > >> Best regards, >> >> -C.B. >> >> >> >