Re: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread Carsten Schnober
Am 12.04.2013 20:08, schrieb SUJIT PAL:
 Hi Carsten,
 
 Why not use your idea of the BooleanQuery but wrap it in a Filter instead? 
 Since you are not doing any scoring (only filtering), the max boolean clauses 
 limit should not apply to a filter.

Hi Sujit,
thanks for your suggestion! I wasn't aware that the max clause limit
does not match for a BooleanQuery wrapped in a filter. I suppose the
ideal way would be to use a BooleanFilter but not a QueryWrapperFilter,
right?

However, I am also not sure how to apply a filter in my use case because
I perform a SpanQuery. Although SpanQuery#getSpans() does take a Bits
object as an argument (acceptDocs), I haven't been able to figure out
how to generate this Bits object correctly from a Filter object.

Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread Carsten Schnober
Am 15.04.2013 11:27, schrieb Uwe Schindler:

Hi again,

 You are somehow misusing acceptDocs and DocIdSet here, so you have
 to take care, semantics are different:
 - For acceptDocs null means all documents allowed - no deleted
 documents
 - For DocIdSet null means no documents matched

 Okay, as described above, I would now pass either the result of
 getLiveDocs() or Bits.MatchAllDocuments() as the acceptDocs argument to
 getDocIdSet():

 MapTerm, TermContext termContexts = new HashMap();
 AtomicReaderContext atomic = ...
 ChainedFilter filter = ...
 
 You just pass getLiveDocs(), no null check needed. Using your code would 
 bring a slowdown for indexes without deletions.

This makes sense to me, but now I get zero matches in all searches using
the filter. I am pondering this remark in the documentation of
Filter.getDocIdSet(AtomicReaderContext context, Bits acceptDocs):
acceptDocs - Bits that represent the allowable docs to match (typically
deleted docs but possibly filtering other documents)

I understand that getLiveDocs() returns the document bits set that
represent NON-deleted documents which seems to match the first part of
the description (allowable docs). However, why does it say in brackets
typically deleted docs? I had ignored this so far, but as I get zero
results now, this might be relevant.

I am also thinking about how to possibly make use of a
BitsFilteredDocIdSet in the following kind:

ChainFilter filter = ...
AtomicReaderContext = ...

Bits alldocs = atomic.reader().getLiveDocs();
DocIdSet docids = filter.getDocIdSet(atomic, alldocs);
BitsFilteredDocIdSet filtered = new BitsFilteredDocIdSet(docids, alldocs);
Spans luceneSpans = sq.getSpans(atomic, filtered.bits(), termContexts);

However, the documentation of the constructor public
BitsFilteredDocIdSet(DocIdSet innerSet, Bits acceptDocs) does not make
it clear to me whether I am applying the arguments correcty. I fails
especially to understand the acceptDocs argument again:
acceptDocs - Allowed docs, all docids not in this set will not be
returned by this DocIdSet

Would this be the correct way to apply a filter on a SpanQuery?
Thanks!
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread Uwe Schindler
Hi,

 Hi again,
 
  You are somehow misusing acceptDocs and DocIdSet here, so you
 have
  to take care, semantics are different:
  - For acceptDocs null means all documents allowed - no deleted
  documents
  - For DocIdSet null means no documents matched
 
  Okay, as described above, I would now pass either the result of
  getLiveDocs() or Bits.MatchAllDocuments() as the acceptDocs argument
  to
  getDocIdSet():
 
  MapTerm, TermContext termContexts = new HashMap();
  AtomicReaderContext atomic = ...
  ChainedFilter filter = ...
 
  You just pass getLiveDocs(), no null check needed. Using your code would
 bring a slowdown for indexes without deletions.
 
 This makes sense to me, but now I get zero matches in all searches using the
 filter. I am pondering this remark in the documentation of
 Filter.getDocIdSet(AtomicReaderContext context, Bits acceptDocs):
 acceptDocs - Bits that represent the allowable docs to match (typically
 deleted docs but possibly filtering other documents)

This just means, you can pass liveDocs as got from AtomicReader (live == 
inverse deleted docs), but you can pass also any other Bits implementation that 
may remove more documents from results. This is what you are dowing with spans.

Passing NULL means all documents are allowed, if this would not be the case, 
whole Lucene queries and filters would not work at all, so if you get 0 docs, 
you must have missed something else. If this is not the case, your filter may 
behave wrong. Look at e.g. FilteredQuery, IndexSearcher or any other query in 
Lucene that handles acceptDocs - those pass getLiveDocs() down. If they are 
null, that means all documents are allowed. The javadocs on Scorer/Filter/... 
should be more clear about this. Can you open an issue about Javadocs?

 I understand that getLiveDocs() returns the document bits set that represent
 NON-deleted documents which seems to match the first part of the
 description (allowable docs). However, why does it say in brackets typically
 deleted docs? I had ignored this so far, but as I get zero results now, this
 might be relevant.

See above.

 I am also thinking about how to possibly make use of a BitsFilteredDocIdSet
 in the following kind:
 
 ChainFilter filter = ...
 AtomicReaderContext = ...
 
 Bits alldocs = atomic.reader().getLiveDocs(); DocIdSet docids =
 filter.getDocIdSet(atomic, alldocs); BitsFilteredDocIdSet filtered = new
 BitsFilteredDocIdSet(docids, alldocs); Spans luceneSpans =
 sq.getSpans(atomic, filtered.bits(), termContexts);
 
 However, the documentation of the constructor public
 BitsFilteredDocIdSet(DocIdSet innerSet, Bits acceptDocs) does not make it
 clear to me whether I am applying the arguments correcty. I fails especially 
 to
 understand the acceptDocs argument again:
 acceptDocs - Allowed docs, all docids not in this set will not be returned by
 this DocIdSet

You should use BitsFilteredDocIdSet.wrap(), the ctor does not do null checks.

 Would this be the correct way to apply a filter on a SpanQuery?

new FilteredQuery(SpanQuery,Filter)?

 Thanks!
 Carsten
 
 --
 Institut für Deutsche Sprache | http://www.ids-mannheim.de
 Projekt KorAP | http://korap.ids-mannheim.de
 Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
 Korpusanalyseplattform der nächsten Generation Next Generation Corpus
 Analysis Platform
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread Carsten Schnober
Am 15.04.2013 13:43, schrieb Uwe Schindler:

Hi,

 Passing NULL means all documents are allowed, if this would not be the case, 
 whole Lucene queries and filters would not work at all, so if you get 0 docs, 
 you must have missed something else. If this is not the case, your filter may 
 behave wrong. Look at e.g. FilteredQuery, IndexSearcher or any other query in 
 Lucene that handles acceptDocs - those pass getLiveDocs() down. If they are 
 null, that means all documents are allowed. The javadocs on Scorer/Filter/... 
 should be more clear about this. Can you open an issue about Javadocs?

I'll open an issue as soon as I have understood how this should be
corrected. :)
I think I've pin-pointed my problem: I use a TermsFilter, get a DocIdSet
with TermsFilter.getDocIdSet(atomic, atomic.reader().getLiveDocs()), and
eventually retrieve a Bits object from that with DocIdSet.bits().
However, the latter always returns null. Wrapping the TermsFilter into a
CachingWrapperFilter doesn't change that. I was using a
QueryWrapperFilter before which would give me a DocIdSet object from
which I could get a proper Bits object to pass to SpanQuery.getSpans().
Is there any way I could extract a Bits object from a TermsFilter?


 Would this be the correct way to apply a filter on a SpanQuery?
 
 new FilteredQuery(SpanQuery,Filter)?

Okay, I formulated the question wrongly. I need to call
SpanQuery.getSpans() because I have to process the resultings Spans
object. Therefore, I actually meant: what is the general way to generate
a Bits object from a Filter that can be used as the 'acceptedDocs' argument?

Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread SUJIT PAL
Hi Uwe,

Thanks for the info, I was under the impression that it didn't... I got this 
info (that filters don't have a limit because they are not scoring) from a 
document like the one below. Can't say this is the exact doc because its been a 
while since I saw that, though.

http://searchhub.org/2009/06/08/bringing-the-highlighter-back-to-wildcard-queries-in-solr-14/


As a response to this performance pitfall on very large indices’s (and the 
infamous TooManyClauses exception), new queries were developed that relied on a 
new Query class called ConstantScoreQuery. ConstantScoreQuerys accept a filter 
of matching documents and then score with a constant value equal to the boost. 
Depending on the qualities of your index, this method can be faster than the 
Boolean expansion method, and more importantly, does not suffer from 
TooManyClauses exceptions. Rather than matching and scoring n BooleanQuery 
clauses (potentially thousands of clauses), a single filter is enumerated and 
then traveled for scoring. On the other hand, constructing and scoring with a 
BooleanQuery containing a few clauses is likely to be much faster than 
constructing and traveling a Filter.


-sujit

On Apr 15, 2013, at 1:04 AM, Uwe Schindler wrote:

 The limit also applies for filters. If you have a list of terms ORed 
 together, the fastest way is not to use a BooleanQuery at all, but instead a 
 TermsFilter (which has no limits).
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
 -Original Message-
 From: Carsten Schnober [mailto:schno...@ids-mannheim.de]
 Sent: Monday, April 15, 2013 9:53 AM
 To: java-user@lucene.apache.org
 Subject: Re: Statically store sub-collections for search (faceted search?)
 
 Am 12.04.2013 20:08, schrieb SUJIT PAL:
 Hi Carsten,
 
 Why not use your idea of the BooleanQuery but wrap it in a Filter instead?
 Since you are not doing any scoring (only filtering), the max boolean clauses
 limit should not apply to a filter.
 
 Hi Sujit,
 thanks for your suggestion! I wasn't aware that the max clause limit does not
 match for a BooleanQuery wrapped in a filter. I suppose the ideal way would
 be to use a BooleanFilter but not a QueryWrapperFilter, right?
 
 However, I am also not sure how to apply a filter in my use case because I
 perform a SpanQuery. Although SpanQuery#getSpans() does take a Bits
 object as an argument (acceptDocs), I haven't been able to figure out how to
 generate this Bits object correctly from a Filter object.
 
 Best,
 Carsten
 
 --
 Institut für Deutsche Sprache | http://www.ids-mannheim.de
 Projekt KorAP | http://korap.ids-mannheim.de
 Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
 Korpusanalyseplattform der nächsten Generation Next Generation Corpus
 Analysis Platform
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread Uwe Schindler
Hi,

Original Message-
 From: Sujit Pal [mailto:sujitatgt...@gmail.com] On Behalf Of SUJIT PAL
 Sent: Monday, April 15, 2013 9:43 PM
 To: java-user@lucene.apache.org
 Subject: Re: Statically store sub-collections for search (faceted search?)
 
 Hi Uwe,
 
 Thanks for the info, I was under the impression that it didn't... I got this 
 info
 (that filters don't have a limit because they are not scoring) from a document
 like the one below. Can't say this is the exact doc because its been a while
 since I saw that, though.
 
 http://searchhub.org/2009/06/08/bringing-the-highlighter-back-to-wildcard-
 queries-in-solr-14/
 
 
 As a response to this performance pitfall on very large indices’s (and the
 infamous TooManyClauses exception), new queries were developed that
 relied on a new Query class called ConstantScoreQuery.
 ConstantScoreQuerys accept a filter of matching documents and then score
 with a constant value equal to the boost. Depending on the qualities of your
 index, this method can be faster than the Boolean expansion method, and
 more importantly, does not suffer from TooManyClauses exceptions. Rather
 than matching and scoring n BooleanQuery clauses (potentially thousands of
 clauses), a single filter is enumerated and then traveled for scoring. On the
 other hand, constructing and scoring with a BooleanQuery containing a few
 clauses is likely to be much faster than constructing and traveling a Filter.
 

This is true, but you misunderstood it: This is about MultiTermQueries (which 
is the superclass of WildcardQuery, Fuzzy-, and range queries). Those queries 
are no native Lucene queries, so they rewrite to basic/native queries. In 
earlier Lucene versions, Wildcards were always rewritten to BooleanQueries with 
many TermQueries (one for each term that matches the wildcard), leading to the 
problem with too many terms. This is still the case, but only in some limits 
(this mode is only used if the wildcard expands to few terms). Those 
BooleanQueris are then used with ConstantScoreQuery(Query).
The above text talks about another mode (which is used for many terms today): 
*No* BooleanQuery is build at all, instead all matching term's documents are 
marked in a BitSet and this BitSet is used with a Filter to construct a 
different Query type: ConstantScoreQuery(Filter). The BooleanQuery max clause 
count does not apply, because no BooleanQuery is involved in the whole process. 
If you use ConstantScoreQuery(BooleanQuery), the limit still applies, but not 
for ConstantScoreQuery(internalWildcardFilter).

Uwe

 On Apr 15, 2013, at 1:04 AM, Uwe Schindler wrote:
 
  The limit also applies for filters. If you have a list of terms ORed 
  together,
 the fastest way is not to use a BooleanQuery at all, but instead a TermsFilter
 (which has no limits).
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: Carsten Schnober [mailto:schno...@ids-mannheim.de]
  Sent: Monday, April 15, 2013 9:53 AM
  To: java-user@lucene.apache.org
  Subject: Re: Statically store sub-collections for search (faceted
  search?)
 
  Am 12.04.2013 20:08, schrieb SUJIT PAL:
  Hi Carsten,
 
  Why not use your idea of the BooleanQuery but wrap it in a Filter
 instead?
  Since you are not doing any scoring (only filtering), the max boolean
  clauses limit should not apply to a filter.
 
  Hi Sujit,
  thanks for your suggestion! I wasn't aware that the max clause limit
  does not match for a BooleanQuery wrapped in a filter. I suppose the
  ideal way would be to use a BooleanFilter but not a QueryWrapperFilter,
 right?
 
  However, I am also not sure how to apply a filter in my use case
  because I perform a SpanQuery. Although SpanQuery#getSpans() does
  take a Bits object as an argument (acceptDocs), I haven't been able
  to figure out how to generate this Bits object correctly from a Filter
 object.
 
  Best,
  Carsten
 
  --
  Institut für Deutsche Sprache | http://www.ids-mannheim.de
  Projekt KorAP | http://korap.ids-mannheim.de
  Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
  Korpusanalyseplattform der nächsten Generation Next Generation
 Corpus
  Analysis Platform
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user

Re: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread SUJIT PAL
Hi Uwe,

I see, makes sense, thanks very much for the info. Sorry about giving you wrong 
info Carsten.

-sujit

On Apr 15, 2013, at 1:06 PM, Uwe Schindler wrote:

 Hi,
 
 Original Message-
 From: Sujit Pal [mailto:sujitatgt...@gmail.com] On Behalf Of SUJIT PAL
 Sent: Monday, April 15, 2013 9:43 PM
 To: java-user@lucene.apache.org
 Subject: Re: Statically store sub-collections for search (faceted search?)
 
 Hi Uwe,
 
 Thanks for the info, I was under the impression that it didn't... I got this 
 info
 (that filters don't have a limit because they are not scoring) from a 
 document
 like the one below. Can't say this is the exact doc because its been a while
 since I saw that, though.
 
 http://searchhub.org/2009/06/08/bringing-the-highlighter-back-to-wildcard-
 queries-in-solr-14/
 
 
 As a response to this performance pitfall on very large indices’s (and the
 infamous TooManyClauses exception), new queries were developed that
 relied on a new Query class called ConstantScoreQuery.
 ConstantScoreQuerys accept a filter of matching documents and then score
 with a constant value equal to the boost. Depending on the qualities of your
 index, this method can be faster than the Boolean expansion method, and
 more importantly, does not suffer from TooManyClauses exceptions. Rather
 than matching and scoring n BooleanQuery clauses (potentially thousands of
 clauses), a single filter is enumerated and then traveled for scoring. On the
 other hand, constructing and scoring with a BooleanQuery containing a few
 clauses is likely to be much faster than constructing and traveling a Filter.
 
 
 This is true, but you misunderstood it: This is about MultiTermQueries (which 
 is the superclass of WildcardQuery, Fuzzy-, and range queries). Those queries 
 are no native Lucene queries, so they rewrite to basic/native queries. In 
 earlier Lucene versions, Wildcards were always rewritten to BooleanQueries 
 with many TermQueries (one for each term that matches the wildcard), leading 
 to the problem with too many terms. This is still the case, but only in some 
 limits (this mode is only used if the wildcard expands to few terms). Those 
 BooleanQueris are then used with ConstantScoreQuery(Query).
 The above text talks about another mode (which is used for many terms today): 
 *No* BooleanQuery is build at all, instead all matching term's documents are 
 marked in a BitSet and this BitSet is used with a Filter to construct a 
 different Query type: ConstantScoreQuery(Filter). The BooleanQuery max clause 
 count does not apply, because no BooleanQuery is involved in the whole 
 process. If you use ConstantScoreQuery(BooleanQuery), the limit still 
 applies, but not for ConstantScoreQuery(internalWildcardFilter).
 
 Uwe
 
 On Apr 15, 2013, at 1:04 AM, Uwe Schindler wrote:
 
 The limit also applies for filters. If you have a list of terms ORed 
 together,
 the fastest way is not to use a BooleanQuery at all, but instead a 
 TermsFilter
 (which has no limits).
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
 -Original Message-
 From: Carsten Schnober [mailto:schno...@ids-mannheim.de]
 Sent: Monday, April 15, 2013 9:53 AM
 To: java-user@lucene.apache.org
 Subject: Re: Statically store sub-collections for search (faceted
 search?)
 
 Am 12.04.2013 20:08, schrieb SUJIT PAL:
 Hi Carsten,
 
 Why not use your idea of the BooleanQuery but wrap it in a Filter
 instead?
 Since you are not doing any scoring (only filtering), the max boolean
 clauses limit should not apply to a filter.
 
 Hi Sujit,
 thanks for your suggestion! I wasn't aware that the max clause limit
 does not match for a BooleanQuery wrapped in a filter. I suppose the
 ideal way would be to use a BooleanFilter but not a QueryWrapperFilter,
 right?
 
 However, I am also not sure how to apply a filter in my use case
 because I perform a SpanQuery. Although SpanQuery#getSpans() does
 take a Bits object as an argument (acceptDocs), I haven't been able
 to figure out how to generate this Bits object correctly from a Filter
 object.
 
 Best,
 Carsten
 
 --
 Institut für Deutsche Sprache | http://www.ids-mannheim.de
 Projekt KorAP | http://korap.ids-mannheim.de
 Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
 Korpusanalyseplattform der nächsten Generation Next Generation
 Corpus
 Analysis Platform
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e

Re: Statically store sub-collections for search (faceted search?)

2013-04-13 Thread Shai Erera
Hi Carsten,

You're right that Lucene document numbers are ephemeral, but they are
consistent for a certain IndexReader instance. So perhaps you can use
SearcherLifetimeManager to obtain a 'version' of the reader that returned
the original results and store a bitset together with that version. Then
when the user further searches this subset of documents, you pull the
relevant reader from SLM given the 'version' information.

I think that you can write your own Pruner which prunes IR
instances/versions when their corresponding docs subset tables are no
longer needed...

Shai


On Fri, Apr 12, 2013 at 9:08 PM, SUJIT PAL sujit@comcast.net wrote:

 Hi Carsten,

 Why not use your idea of the BooleanQuery but wrap it in a Filter instead?
 Since you are not doing any scoring (only filtering), the max boolean
 clauses limit should not apply to a filter.

 -sujit

 On Apr 12, 2013, at 7:34 AM, Carsten Schnober wrote:

  Dear list,
  I would like to create a sub-set of the documents in an index that is to
  be used for further searches. However, the criteria that lead to the
  creation of that sub-set are not predefined so I think that faceted
  search cannot be applied my this use case.
 
  For instance:
  A user searches for documents that contain token 'A' in a field 'text'.
  These results form a set of documents that is persistently stored (in a
  database). Each document in the index has a field 'id' that identifies
  it, so these external IDs are stored in the database.
 
  Later on, a user loads the document IDs from the database and wants to
  execute another search on this set of documents only. However,
  performing a search on the full index and subsequently filtering the
  results against that list of documents takes very long if there are many
  matches. This is obvious as I have to retrieve the external id from each
  matching document and check whether it is part of the desired sub-set.
  Constructing a BooleanQuery in the style id:Doc1 OR id:Doc2 ... is not
  suitable either because there could be thousands of documents exceeding
  any limit for Boolean clauses.
 
  Any suggestions how to solve this? I would have gone for the Lucene
  document numbers and store them as a bit set that I could use as a
  filter during later searches, but I read that the document numbers are
  ephemeral.
 
  One possible way out seems to be to create another index from the
  documents that have matched the initial search, but this seems quite an
  overkill, especially if there are plenty of them...
 
  Thanks for any hint!
  Carsten
 
  --
  Institut für Deutsche Sprache | http://www.ids-mannheim.de
  Projekt KorAP | http://korap.ids-mannheim.de
  Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
  Korpusanalyseplattform der nächsten Generation
  Next Generation Corpus Analysis Platform
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Statically store sub-collections for search (faceted search?)

2013-04-12 Thread SUJIT PAL
Hi Carsten,

Why not use your idea of the BooleanQuery but wrap it in a Filter instead? 
Since you are not doing any scoring (only filtering), the max boolean clauses 
limit should not apply to a filter.

-sujit

On Apr 12, 2013, at 7:34 AM, Carsten Schnober wrote:

 Dear list,
 I would like to create a sub-set of the documents in an index that is to
 be used for further searches. However, the criteria that lead to the
 creation of that sub-set are not predefined so I think that faceted
 search cannot be applied my this use case.
 
 For instance:
 A user searches for documents that contain token 'A' in a field 'text'.
 These results form a set of documents that is persistently stored (in a
 database). Each document in the index has a field 'id' that identifies
 it, so these external IDs are stored in the database.
 
 Later on, a user loads the document IDs from the database and wants to
 execute another search on this set of documents only. However,
 performing a search on the full index and subsequently filtering the
 results against that list of documents takes very long if there are many
 matches. This is obvious as I have to retrieve the external id from each
 matching document and check whether it is part of the desired sub-set.
 Constructing a BooleanQuery in the style id:Doc1 OR id:Doc2 ... is not
 suitable either because there could be thousands of documents exceeding
 any limit for Boolean clauses.
 
 Any suggestions how to solve this? I would have gone for the Lucene
 document numbers and store them as a bit set that I could use as a
 filter during later searches, but I read that the document numbers are
 ephemeral.
 
 One possible way out seems to be to create another index from the
 documents that have matched the initial search, but this seems quite an
 overkill, especially if there are plenty of them...
 
 Thanks for any hint!
 Carsten
 
 -- 
 Institut für Deutsche Sprache | http://www.ids-mannheim.de
 Projekt KorAP | http://korap.ids-mannheim.de
 Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
 Korpusanalyseplattform der nächsten Generation
 Next Generation Corpus Analysis Platform
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org