Re: multiple local indexes
Thanks for your comments, Jonathon. Here is some information that gives a brief overview of the eGranary Platform in order to quickly outline the need for a solution for bringing multiple indexes into one searchable collection. http://www.widernet.org/egranary/info/multipleIndexes Thanks, Brent On 9/28/2010 5:40 PM, Jonathan Rochkind wrote: Honestly, I think just putting everything in the same index is your best bet. Are you sure your "particular needs of your project" can't be served by one combined index? You can certainly still query on just a portion of the index when needed using fq -- you can even create a request handler (or multiple request handlers) with "invariant" or "appends" to force that all queries through that request handler have a fixed fq. From: Brent Palmer [br...@widernet.org] Sent: Tuesday, September 28, 2010 6:04 PM To: solr-user@lucene.apache.org Subject: multiple local indexes In our application, we need to be able to search across multiple local indexes. We need this not so much for performance reasons, but because of the particular needs of our project. But the indexes, while sharing the same schema can be vary different in terms of size and distribution of documents. By that I mean that some indexes may have a lot more documents about some topic while others will have more documents about other topics. We want to be able add documents to the individual indexes as well. I can provide more detail about our project is necessary. Thus, the Distributed Search feature with shards in different cores seems to be an obvious solution except for the limitation of distributed idf. First, I want to make sure my understanding about the distributed idf limitation are correct: If your documents are spread across your shards evenly, then the distribution of terms across the individual shards can be assumed to be even enough not to matter. If, as in our case, the shards are not very uniform, then this limitation is magnified. Even though simplistic, do I have the basic idea? We have hacked together something that allows us to read from multiple indexes, but it isn't really a long-term solution. It's just sort of shoe-horned in there. Here are some notes from the programmer who worked on this: Two custom files: EgranaryIndexReaderFactory.java and EgranaryIndexReader.java EgranaryIndexReader.java No real work is done here. This class extends lucene.index.MultiReader and overrides the directory() and getVersion() methods inherited from IndexReader. These methods don't make sense for a MultiReader as they only return a single value. However, Solr expects Readers to have these methods. directory() was overridden to return a call to directory() on the first reader in the subreader list. The same was done for getVersion(). This hack makes any use of these methods by Solr somewhat pointless. EgranaryIndexReaderFactory.java Overrides the newReader(Directory indexDir, boolean readOnly) method The expected behavior of this method is to construct a Reader from the index at indexDir. However, this method ignores indexDir and reads a list of indexDirs from the solrconfig.xml file. These indices are used to create a list of lucene.index.IndexReader classes. This list is then used to create the EgranaryIndexReader. So the second questions is: Does anybody have other ideas about how we might solve this problem? Is distributed search still our best bet? Thanks for your thoughts! Brent
RE: multiple local indexes
Honestly, I think just putting everything in the same index is your best bet. Are you sure your "particular needs of your project" can't be served by one combined index? You can certainly still query on just a portion of the index when needed using fq -- you can even create a request handler (or multiple request handlers) with "invariant" or "appends" to force that all queries through that request handler have a fixed fq. From: Brent Palmer [br...@widernet.org] Sent: Tuesday, September 28, 2010 6:04 PM To: solr-user@lucene.apache.org Subject: multiple local indexes In our application, we need to be able to search across multiple local indexes. We need this not so much for performance reasons, but because of the particular needs of our project. But the indexes, while sharing the same schema can be vary different in terms of size and distribution of documents. By that I mean that some indexes may have a lot more documents about some topic while others will have more documents about other topics. We want to be able add documents to the individual indexes as well. I can provide more detail about our project is necessary. Thus, the Distributed Search feature with shards in different cores seems to be an obvious solution except for the limitation of distributed idf. First, I want to make sure my understanding about the distributed idf limitation are correct: If your documents are spread across your shards evenly, then the distribution of terms across the individual shards can be assumed to be even enough not to matter. If, as in our case, the shards are not very uniform, then this limitation is magnified. Even though simplistic, do I have the basic idea? We have hacked together something that allows us to read from multiple indexes, but it isn't really a long-term solution. It's just sort of shoe-horned in there. Here are some notes from the programmer who worked on this: Two custom files: EgranaryIndexReaderFactory.java and EgranaryIndexReader.java EgranaryIndexReader.java No real work is done here. This class extends lucene.index.MultiReader and overrides the directory() and getVersion() methods inherited from IndexReader. These methods don't make sense for a MultiReader as they only return a single value. However, Solr expects Readers to have these methods. directory() was overridden to return a call to directory() on the first reader in the subreader list. The same was done for getVersion(). This hack makes any use of these methods by Solr somewhat pointless. EgranaryIndexReaderFactory.java Overrides the newReader(Directory indexDir, boolean readOnly) method The expected behavior of this method is to construct a Reader from the index at indexDir. However, this method ignores indexDir and reads a list of indexDirs from the solrconfig.xml file. These indices are used to create a list of lucene.index.IndexReader classes. This list is then used to create the EgranaryIndexReader. So the second questions is: Does anybody have other ideas about how we might solve this problem? Is distributed search still our best bet? Thanks for your thoughts! Brent
multiple local indexes
In our application, we need to be able to search across multiple local indexes. We need this not so much for performance reasons, but because of the particular needs of our project. But the indexes, while sharing the same schema can be vary different in terms of size and distribution of documents. By that I mean that some indexes may have a lot more documents about some topic while others will have more documents about other topics. We want to be able add documents to the individual indexes as well. I can provide more detail about our project is necessary. Thus, the Distributed Search feature with shards in different cores seems to be an obvious solution except for the limitation of distributed idf. First, I want to make sure my understanding about the distributed idf limitation are correct: If your documents are spread across your shards evenly, then the distribution of terms across the individual shards can be assumed to be even enough not to matter. If, as in our case, the shards are not very uniform, then this limitation is magnified. Even though simplistic, do I have the basic idea? We have hacked together something that allows us to read from multiple indexes, but it isn't really a long-term solution. It's just sort of shoe-horned in there. Here are some notes from the programmer who worked on this: Two custom files: EgranaryIndexReaderFactory.java and EgranaryIndexReader.java EgranaryIndexReader.java No real work is done here. This class extends lucene.index.MultiReader and overrides the directory() and getVersion() methods inherited from IndexReader. These methods don't make sense for a MultiReader as they only return a single value. However, Solr expects Readers to have these methods. directory() was overridden to return a call to directory() on the first reader in the subreader list. The same was done for getVersion(). This hack makes any use of these methods by Solr somewhat pointless. EgranaryIndexReaderFactory.java Overrides the newReader(Directory indexDir, boolean readOnly) method The expected behavior of this method is to construct a Reader from the index at indexDir. However, this method ignores indexDir and reads a list of indexDirs from the solrconfig.xml file. These indices are used to create a list of lucene.index.IndexReader classes. This list is then used to create the EgranaryIndexReader. So the second questions is: Does anybody have other ideas about how we might solve this problem? Is distributed search still our best bet? Thanks for your thoughts! Brent