Re: multiple collections indexing

Vladimir Lukin Wed, 19 Mar 2003 04:06:27 -0800

Hello Morus,

I'd tell, how wildcard query works:


1. First, it runs over the lexcon and collects a list of terms that
   satisfy the specified pattern.
2. Then it makes a boolean query joining the collected terms with "or".
3. Then the constructed boolean query is used for searching.

So is seems to me that using a wildcard search doesn't give any perfomance benefit in
comparison with extending the query by a or-joined list of collection
names, because both have almost the same complexity.

Moreover, using a field for collection name will cause no problem,
since though boolean query can't contain more than 32 (in release i use) required and 
prohibited clauses
it has no limit for optional clauses, so we can join with "or" up to
Integer.MAX_INT queries.

Using filters can afford some perfomance benefit, I mean that if you
could somehow create a filter faster than using a query on collection
names and then use this filter with the main query. this can be
approached, for example, by loading documents id's list for each collection
into memory and then merging them. This will give some benefit in
time, but it inefficient according to memory use, and also you'd have
to write some code.

> Hi,

> we are currently evaluating lucene.

> The data we'd like to index consists of ~ 80 collections of documents
> (a few hundred up to 200000 documents per collection, ~ 1.5 million documents
> total; medium document size is in the order of 1 kB).

> Searches must be able on any combination of collections.
> A typical search includes ~ 40 collections.

> Now the question is, how to implement this in lucene best.

> Currently I see basically three possibilities:
> - create a data field containing the collection name for each document
>   and extend the query by a or-combined list of queries on this name filed.
> - create an index per collection and use a MultiSearcher to search all
>   interesting indexes.
> - (a third on I just discovered): create a data field containing a
>   marker for each collection
>   x100000000000000000... for the first collection
>   x010000000000000000... for the second
>   x001000000000000000... for the third
>   and so on.
>   The query might use a wildcard search on this field using x?0?00000...
>   specifying '?' for each collection that should be searched on, and '0'
>   for the others.
>   The marker would be very long though (the number of collections is
>   growing, so we have to keep space for new one also).

> So far we set up the first aproach (one index; size ~ 750 M) and this 
> seems to work in principle and with reasonable performance.
> I'm not too optimistic about the second aproach. If I understand the docs
> correctly this would be a sequential search on each involved index and
> combining the results.

> So questions:
> - has anyone experience with such a setup?
> - are there other aproaches to deal with it?
> - is my expectation, that multiple indexes are worse reasonable or should
>   we give it a try?
> - how is wildcard search done? Could this be an improvement?

> I understand that in the end, we have to check this ourselfs, but I'd
> appreciate any hints and advices since I couln'd find much on this
> issue in the docs.

> greetings
>         Morus

> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]




-- 
Best regards,
 Vladimir                            mailto:[EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: multiple collections indexing

Reply via email to