Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-07 Thread Peter W.
Howdy, Your work is outstanding and will hopefully be adopted soon. The HDFS distributed Lucene index solves many of the various dependencies introduced by achieving this another way using RMI, HTTP (serialized objects w/servlets) or Tomcat balancing with mysql databases, schemas and connection

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-07 Thread Ning Li
We welcome your input. Discussions are mainly on [EMAIL PROTECTED] now (a thread with the same title). On 2/7/08, Dennis Kubes <[EMAIL PROTECTED]> wrote: > This is actually something we were planning on building into Nutch. > > Dennis

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-07 Thread Dennis Kubes
This is actually something we were planning on building into Nutch. Dennis Ning Li wrote: On 2/6/08, Ted Dunning <[EMAIL PROTECTED]> wrote: Our best work-around is to simply take a shard out of service during delivery of an updated index. This is obviously not a good solution. How many shar

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ted Dunning
We have quite a few serving the load, but if we are trying to update relatively often (say every 30 minutes), then having a server out of action for several minutes really hurts. The outage is that long because you have to A) turn off traffic B) wait for traffic to actually stop C) move the mult

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
On 2/6/08, Ted Dunning <[EMAIL PROTECTED]> wrote: > Our best work-around is to simply take a shard out of service during delivery > of an updated index. This is obviously not a good solution. How many shard servers are serving each shard? If it's more than one, you can have the rest of the shard

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ted Dunning
Very nice summary. One of the issues that we have had with multiple search servers is that on linux, there can be substantial contention for disk I/O. This means that as a new index is being written, access to the current index can be stalled for very long periods of time (sometimes >10s). This

Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
There have been several proposals for a Lucene-based distributed index architecture. 1) Doug Cutting's "Index Server Project Proposal" at http://www.mail-archive.com/[EMAIL PROTECTED]/msg00338.html 2) Solr's "Distributed Search" at http://wiki.apache.org/solr/DistributedSearch 3) Mark Bu