Re: multiple local indexes

2010-09-28 Thread Brent Palmer
 Thanks for your comments, Jonathon.  Here is some information that 
gives a brief overview of the eGranary Platform in order to quickly 
outline the need for a solution for bringing multiple indexes into one 
searchable collection.


http://www.widernet.org/egranary/info/multipleIndexes

Thanks,
Brent


On 9/28/2010 5:40 PM, Jonathan Rochkind wrote:

Honestly, I think just putting everything in the same index is your best bet.  Are you sure your 
"particular needs of your project" can't be served by one combined index?  You can certainly still 
query on just a portion of the index when needed using fq -- you can even create a request handler (or 
multiple request handlers) with "invariant" or "appends" to force that all queries 
through that request handler have a fixed fq.

From: Brent Palmer [br...@widernet.org]
Sent: Tuesday, September 28, 2010 6:04 PM
To: solr-user@lucene.apache.org
Subject: multiple local indexes

In our application, we need to be able to search across multiple local
indexes.  We need this not so much for performance reasons, but because
of the particular needs of our project.  But the indexes, while sharing
the same schema can be vary different in terms of size and distribution
of documents.  By that I mean that some indexes may have a lot more
documents about some topic while others will have more documents about
other topics.  We want to be able add documents to the individual
indexes as well.  I can provide more detail about our project is
necessary.  Thus, the Distributed Search feature with shards in
different cores seems to be an obvious solution except for the
limitation of distributed idf.

First, I want to make sure my understanding about the distributed idf
limitation are correct:  If your documents are spread across your shards
evenly, then the distribution of terms across the individual shards can
be assumed to be even enough not to matter.  If, as in our case, the
shards are not very uniform, then this limitation is magnified.  Even
though simplistic, do I have the basic idea?

We have hacked together something that allows us to read from multiple
indexes, but it isn't really a long-term solution.  It's just sort of
shoe-horned in there.   Here are some notes from the programmer who
worked on this:
Two custom files: EgranaryIndexReaderFactory.java and
EgranaryIndexReader.java
EgranaryIndexReader.java
No real work is done here. This class extends
lucene.index.MultiReader and overrides the directory() and getVersion()
methods inherited from IndexReader.
These methods don't  make sense for a MultiReader as they only return
a single value. However, Solr expects Readers to have these methods.
directory() was
overridden to return a call to directory() on the first reader in the
subreader list. The same was done for getVersion(). This hack makes any
use of these methods
by Solr somewhat pointless.

EgranaryIndexReaderFactory.java
Overrides the newReader(Directory indexDir, boolean readOnly) method
The expected behavior of this method is to construct a Reader from
the index at indexDir.
However, this method ignores indexDir and reads a list of indexDirs
from the solrconfig.xml file.
These indices are used to create a list of lucene.index.IndexReader
classes. This list is then used to create the EgranaryIndexReader.

So the second questions is: Does anybody have other ideas about how we
might solve this problem?  Is distributed search still our best bet?

Thanks for your thoughts!
Brent




RE: multiple local indexes

2010-09-28 Thread Jonathan Rochkind
Honestly, I think just putting everything in the same index is your best bet.  
Are you sure your "particular needs of your project" can't be served by one 
combined index?  You can certainly still query on just a portion of the index 
when needed using fq -- you can even create a request handler (or multiple 
request handlers) with "invariant" or "appends" to force that all queries 
through that request handler have a fixed fq. 

From: Brent Palmer [br...@widernet.org]
Sent: Tuesday, September 28, 2010 6:04 PM
To: solr-user@lucene.apache.org
Subject: multiple local indexes

In our application, we need to be able to search across multiple local
indexes.  We need this not so much for performance reasons, but because
of the particular needs of our project.  But the indexes, while sharing
the same schema can be vary different in terms of size and distribution
of documents.  By that I mean that some indexes may have a lot more
documents about some topic while others will have more documents about
other topics.  We want to be able add documents to the individual
indexes as well.  I can provide more detail about our project is
necessary.  Thus, the Distributed Search feature with shards in
different cores seems to be an obvious solution except for the
limitation of distributed idf.

First, I want to make sure my understanding about the distributed idf
limitation are correct:  If your documents are spread across your shards
evenly, then the distribution of terms across the individual shards can
be assumed to be even enough not to matter.  If, as in our case, the
shards are not very uniform, then this limitation is magnified.  Even
though simplistic, do I have the basic idea?

We have hacked together something that allows us to read from multiple
indexes, but it isn't really a long-term solution.  It's just sort of
shoe-horned in there.   Here are some notes from the programmer who
worked on this:
   Two custom files: EgranaryIndexReaderFactory.java and
EgranaryIndexReader.java
   EgranaryIndexReader.java
   No real work is done here. This class extends
lucene.index.MultiReader and overrides the directory() and getVersion()
methods inherited from IndexReader.
   These methods don't  make sense for a MultiReader as they only return
a single value. However, Solr expects Readers to have these methods.
directory() was
   overridden to return a call to directory() on the first reader in the
subreader list. The same was done for getVersion(). This hack makes any
use of these methods
   by Solr somewhat pointless.

   EgranaryIndexReaderFactory.java
   Overrides the newReader(Directory indexDir, boolean readOnly) method
   The expected behavior of this method is to construct a Reader from
the index at indexDir.
   However, this method ignores indexDir and reads a list of indexDirs
from the solrconfig.xml file.
   These indices are used to create a list of lucene.index.IndexReader
classes. This list is then used to create the EgranaryIndexReader.

So the second questions is: Does anybody have other ideas about how we
might solve this problem?  Is distributed search still our best bet?

Thanks for your thoughts!
Brent


multiple local indexes

2010-09-28 Thread Brent Palmer
In our application, we need to be able to search across multiple local 
indexes.  We need this not so much for performance reasons, but because 
of the particular needs of our project.  But the indexes, while sharing 
the same schema can be vary different in terms of size and distribution 
of documents.  By that I mean that some indexes may have a lot more 
documents about some topic while others will have more documents about 
other topics.  We want to be able add documents to the individual 
indexes as well.  I can provide more detail about our project is 
necessary.  Thus, the Distributed Search feature with shards in 
different cores seems to be an obvious solution except for the 
limitation of distributed idf.


First, I want to make sure my understanding about the distributed idf 
limitation are correct:  If your documents are spread across your shards 
evenly, then the distribution of terms across the individual shards can 
be assumed to be even enough not to matter.  If, as in our case, the 
shards are not very uniform, then this limitation is magnified.  Even 
though simplistic, do I have the basic idea?


We have hacked together something that allows us to read from multiple 
indexes, but it isn't really a long-term solution.  It's just sort of 
shoe-horned in there.   Here are some notes from the programmer who 
worked on this:
  Two custom files: EgranaryIndexReaderFactory.java and 
EgranaryIndexReader.java

  EgranaryIndexReader.java
  No real work is done here. This class extends 
lucene.index.MultiReader and overrides the directory() and getVersion() 
methods inherited from IndexReader.
  These methods don't  make sense for a MultiReader as they only return 
a single value. However, Solr expects Readers to have these methods. 
directory() was
  overridden to return a call to directory() on the first reader in the 
subreader list. The same was done for getVersion(). This hack makes any 
use of these methods

  by Solr somewhat pointless.

  EgranaryIndexReaderFactory.java
  Overrides the newReader(Directory indexDir, boolean readOnly) method
  The expected behavior of this method is to construct a Reader from 
the index at indexDir.
  However, this method ignores indexDir and reads a list of indexDirs 
from the solrconfig.xml file.
  These indices are used to create a list of lucene.index.IndexReader 
classes. This list is then used to create the EgranaryIndexReader.


So the second questions is: Does anybody have other ideas about how we 
might solve this problem?  Is distributed search still our best bet?


Thanks for your thoughts!
Brent