[snip]
> Dogacan, you mentioned that you would like to work on Katta integration.
Could you shed some light on how this fits with the abstract indexing &
searching layer that we now have, and how distributed Solr fits into this
picture?
I haven't yet given much thought to Katta integration. But basically,
I am thinking of
indexing newly-crawled documents as lucene shards and uploading them
to katta for searching. This should be very possible with the new
indexing system. But so far, I have neither studied katta too much nor
given much thought to integration. So I may be missing obvious stuff.
I've got some experience in this area, so let me know what questions,
if any, you've got.
But the basic approach is very simple - just create N indexes (one
per reducer), move this to HDFS, S3, or some other location where the
Katta master & slaves can all access the shards, and then use the
Katta "addIndex" command or supporting Java code to deploy the index.
About distributed solr: I very much like to do this and again, I
think, this should be possible to
do within nutch. However, distributed solr is ultimately uninteresting
to me because (AFAIK) it doesn't have the reliability and
high-availability that hadoop&hbase have, i.e. if a machine dies you
lose that part of the index.
Are there any projects going on that are live indexing systems like
solr, yet are backed up by hadoop HDFS like katta?
Note that Katta doesn't use HDFS as a backing store - the shards are
copied to the local disks of the slaves for performance reasons.
There has been work on making Katta work better for near-real time
updating, versus the currently very batch-oriented approach. See the
Katta list for more details.
-- Ken
--
Ken Krugler
+1 530-210-6378