This task reminds me more of a count(*) sql query than a text search query.
Assuming that using a text search engine is a pre requisite, I can think of two approaches - basing on Lucene scoring as suggested in the question, or a more simple approach (below). For the scoring approach - I don't see an easy way to get the counts from the score of the results, although the TF (term frequency in candidate docs) is known+used during document scoring, and although it seems that the application can be arranged such that TF of search result documents would be the required count. But perhaps a more straight forward solution can do - adding a Lucene document for each star-movie pair. This would also allow easy update when a new movie arrives: just add a document for each "star" in that movie. A document can have these fields: StarFirstName - stored, untokenized StarLastName - stored, untokenized MovieName - stored, tokenized MovieType - stored, untokenized - this is the pre-computed type mentioned below MovieProps - unstored, tokenized - the word "horror" can appear in this field, avoiding a pre-computation step. Now a single search can do all the work: +StarLastName:A* +MovieProps:horror Sorting results by StarLastName would group all results of same "star" and also allow to count them for each star. This would create more documents in the index - #stars * |#movies per star| - so there may be performance considerations, depending on the volume of the data... Regards, Doron "Russell M. Allen" <[EMAIL PROTECTED]> wrote on 27/07/2006 09:02:46: > I am curious about the potential use of document scoring as a means to > extract additional data from an index. Specifically, I would like the > score to be a count of how many times a particular field matched a set > of terms. > > For example, I am indexing movie-stars (Each document is a movie-star). > A movie-star has a number of fields, such as name, movies they have been > in, etc. I want to produce an 'index' of stars by name and show how > many movies, which match a filter, that they have appeared in. > > In natural language my query might be: > "List all stars who have appeared in a 'horror' movie, where > last name starts with A, and tell me how many horror movies they were > in." > > My search will look something like this: > "+lastName:A* +movie:(1 7 21 58 92)" //where movie is a > previously computed list of 'horror' movie ids > > If my index contained the following documents: > doc1 = lastName:Anna movie:{3 10} > doc2 = lastName:Aba movie:{1 10 12} > doc3 = lastName:Addd movie:{3 21 55 92} > doc4 = lastName:Baaa movie:{7 56} > > I would like to get back: > doc2, score of 1 //score of 1 because only movie 1 matched > doc3, score of 2 //score of 2 because movies 21 and 92 matched > > > > Currently, we perform an initial query against our Star index to > retrieve a list of stars. Then we perform N queries against a separate > movie index to count the number of movies that match our sub filter > 'horror'. This is obviously very inefficient, and as I've shown above, > the information (count) is available during the primary query. > > Thoughts? > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]