Cam Bazz wrote:
Hello,

When I use an custom input format, as in the nutch project - do I have to
keep my index in DFS, or regular file system?

You have to ensure that your indexes are accessible by the map/reduce tasks, ie. by using hdfs, s3, nfs, kfs, etc.
By the way, are there any alternatives to nutch?
yes, of course. There are all sorts of open source crawlers / indexers.

Best Regards


-C.B.

On Fri, Jun 27, 2008 at 10:08 AM, Enis Soztutar <[EMAIL PROTECTED]>
wrote:

Cam Bazz wrote:

hello,

I have a lucene index storing documents which holds src and dst words.
word
pairs may repeat. (it is a multigraph).

I want to use hadoop to count how many of the same word pairs there are. I
have looked at the aggregateword count example, and I understand that if I
make a txt file
such as

src1>dst2
src2>dst2
src1>dst2

..

and use something similar to the aggregate word count example, I will get
the result desired.

Now questions. how can I hookup my lucene index to hadoop. is there a
better
way then dumping the index to a text file with >'s, copying this to dfs
and
getting the results back?


Yes, you can implement an InputFormat to read from the lucene index. You
can use the implementation in the nutch project, the classes
DeleteDuplicates$InputFormat, DeleteDuplicates$DDRecordReader.

how can I make incremental runs? (once the index processed and I got the
results, how can I dump more data onto it so it does not start from
beginning)


As far as i know, there is no easy way for this. Why do you keep your data
as a lucene index?

Best regards,

-C.B.




Reply via email to