Hello Morus, I'd tell, how wildcard query works:
1. First, it runs over the lexcon and collects a list of terms that satisfy the specified pattern. 2. Then it makes a boolean query joining the collected terms with "or". 3. Then the constructed boolean query is used for searching. So is seems to me that using a wildcard search doesn't give any perfomance benefit in comparison with extending the query by a or-joined list of collection names, because both have almost the same complexity. Moreover, using a field for collection name will cause no problem, since though boolean query can't contain more than 32 (in release i use) required and prohibited clauses it has no limit for optional clauses, so we can join with "or" up to Integer.MAX_INT queries. Using filters can afford some perfomance benefit, I mean that if you could somehow create a filter faster than using a query on collection names and then use this filter with the main query. this can be approached, for example, by loading documents id's list for each collection into memory and then merging them. This will give some benefit in time, but it inefficient according to memory use, and also you'd have to write some code. > Hi, > we are currently evaluating lucene. > The data we'd like to index consists of ~ 80 collections of documents > (a few hundred up to 200000 documents per collection, ~ 1.5 million documents > total; medium document size is in the order of 1 kB). > Searches must be able on any combination of collections. > A typical search includes ~ 40 collections. > Now the question is, how to implement this in lucene best. > Currently I see basically three possibilities: > - create a data field containing the collection name for each document > and extend the query by a or-combined list of queries on this name filed. > - create an index per collection and use a MultiSearcher to search all > interesting indexes. > - (a third on I just discovered): create a data field containing a > marker for each collection > x100000000000000000... for the first collection > x010000000000000000... for the second > x001000000000000000... for the third > and so on. > The query might use a wildcard search on this field using x?0?00000... > specifying '?' for each collection that should be searched on, and '0' > for the others. > The marker would be very long though (the number of collections is > growing, so we have to keep space for new one also). > So far we set up the first aproach (one index; size ~ 750 M) and this > seems to work in principle and with reasonable performance. > I'm not too optimistic about the second aproach. If I understand the docs > correctly this would be a sequential search on each involved index and > combining the results. > So questions: > - has anyone experience with such a setup? > - are there other aproaches to deal with it? > - is my expectation, that multiple indexes are worse reasonable or should > we give it a try? > - how is wildcard search done? Could this be an improvement? > I understand that in the end, we have to check this ourselfs, but I'd > appreciate any hints and advices since I couln'd find much on this > issue in the docs. > greetings > Morus > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] -- Best regards, Vladimir mailto:[EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]