Re: Nutch dev. plans

Ken Krugler Mon, 20 Jul 2009 11:20:40 -0700

[snip]

 > Dogacan, you mentioned that you would like to work on Katta integration.

 Could you shed some light on how this fits with the abstract indexing &
 searching layer that we now have, and how distributed Solr fits into this
 picture?


I haven't yet given much thought to Katta integration. But basically,
I am thinking of
indexing newly-crawled documents as lucene shards and uploading them
to katta for searching. This should be very possible with the new
indexing system. But so far, I have neither studied katta too much nor
given much thought to integration. So I may be missing obvious stuff.

I've got some experience in this area, so let me know what questions,if any, you've got.

But the basic approach is very simple - just create N indexes (oneper reducer), move this to HDFS, S3, or some other location where theKatta master & slaves can all access the shards, and then use theKatta "addIndex" command or supporting Java code to deploy the index.

About distributed solr: I very much like to do this and again, I
think, this should be possible to
do within nutch. However, distributed solr is ultimately uninteresting
to me because (AFAIK) it doesn't have the reliability and
high-availability that hadoop&hbase have, i.e. if a machine dies you
lose that part of the index.

Are there any projects going on that are live indexing systems like
solr, yet are backed up by hadoop HDFS like katta?

Note that Katta doesn't use HDFS as a backing store - the shards arecopied to the local disks of the slaves for performance reasons.

There has been work on making Katta work better for near-real timeupdating, versus the currently very batch-oriented approach. See theKatta list for more details.


-- Ken
--
Ken Krugler
+1 530-210-6378

Re: Nutch dev. plans

Reply via email to