RE: No subsearcher in Lucene 3.3?

Joe MA Tue, 30 Aug 2011 09:30:19 -0700

Thanks for the replies.  Here is why I need the subreader (or subsearcher in 
earlier Lucene versions):


I have multiple collections of documents, say broken out by years (it's more 
complex than this, but this illustrates the use case):

Collection1 >>>         D:/some folder/2009/*.pdf                       (lots 
of PDF files)
Collection2 >>>         D:/another folder/2010/*.pdf                    (lots 
of different PDF files)

And so forth.  So in the example above, I would have two indicies, one for each 
year.    When I index, I store the *relative* path of each document as a field. 
 For example, 'link:2009/file1.pdf' or 'link2010/file1.pdf' etc .  I do not 
store the full path to the files in the index.  This has a huge advantage 
because we can move the documents to another file system or server or path 
without rebuilding the index.  I stored the required base path to the documents 
in each collection in a database, external to the collection.   For example, in 
the above example, Collection1 would have a base path of "D:/some folder/".     
Therefore, to actually access a document referenced in a collection, you would 
concat base_path retrieved from the database to the "link" field retrieved from 
the collection.   I would think this is a very common approach.

When searching a single collection, no problem.  But if I want to search the 
two collections at the same time, I need to know which collection the hit came 
from so I can retrieve the base_path from the database.  These base_paths can 
be different.  As mentioned, this was trivial in Lucene 1.x and 2.x as I just 
grabbed the subsearcher from the result, which would for example return a 1 or 
2 indicating which of the two collections the result came from.  Then I can 
build the path to the file.  In other words, subsearcher gave me the foreign 
key I needed to map to additional external information associated with each 
index during a multisearch.  That is now gone in Lucene 3.3.

I guess a real simple solution is just to store a new field with each document 
uniquely identifying which collection.  So in the example above, I could create 
a new field "foreign_key_index"  for each document which would be "Collection1" 
or "Collection2" respectively.  This would surely work, but it would break 
backwards compatibility of my system and would require me to rebuild every 
collection.      Also seems pretty extensive for something so simple.

If there is another way to do this, please advise.  Thanks in advance and much 
appreciated.

- JMA



-----Original Message-----
From: Uwe Schindler [mailto:[email protected]] 
Sent: Monday, August 29, 2011 8:05 PM
To: [email protected]
Subject: RE: No subsearcher in Lucene 3.3?

Why do you need to know the subreader? If you want to get the document's stored 
fields, use the MultiReader.

If you really want to know the subreader, use this:
http://lucene.apache.org/java/3_3_0/api/core/org/apache/lucene/util/ReaderUtil.html#subReader(int,
 org.apache.lucene.index.IndexReader)

But this is "somewhat slow", so don’t use in inner loops.

Devon suggested:
> If I'm understanding your question correctly, in the Collector, you are told 
> which IndexReader you are working with when the setNextReader method is 
> called. Hopefully that helps.

This does not work as expected, because the Collector gets the lowest level 
readers, which are in fact sub-sub-readers (as each single IndexReader contains 
itself of more "SegmentReaders", unless you have optimized sub-indexes).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]


> -----Original Message-----
> From: Joseph MarkAnthony [mailto:[email protected]]
> Sent: Monday, August 29, 2011 8:54 PM
> To: [email protected]
> Subject: No subsearcher in Lucene 3.3?
> 
> Greetings,
>     In the past (Lucene version 2.x) I successfully used
> MultiSearcher.subsearcher() to identify the searchable within a 
> MultiSearcher to which a hit belonged.
> 
> In moving to Lucene 3.3, MultiSearcher is now deprecated, and I am 
> trying to create a standard IndexSearcher over a MultiReader.  I 
> haven't gotten this to work yet but it appears to be the correct 
> approach.  However, I cannot find any corresponding "subsearcher" 
> method that could identify which subreader is the one that finds the hit.
> 
> For example, it used to be straightforward:
> 
> Create a MultiSearcher over several Searchables, and call 
> MultiSearcher.subsearcher to get the searchable that holds each search hit.
> 
> Now, I am creating an IndexSearcher over a MultiReader, which is created over
> an array of IndexReaders.   So when I get a hit, what's the best way to
> determine which of the several subReaders the hit came from?
> 
> Thanks in advance,
> JMA
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: No subsearcher in Lucene 3.3?

Reply via email to