Ok, I will have a shot at the ascending docId order. Chris
2009/10/13 Paul Elschot <paul.elsc...@xs4all.nl> > On Monday 12 October 2009 23:29:07 Christoph Boosz wrote: > > Hi Paul, > > > > Thanks for your suggestion. I will test it within the next few days. > > However, due to memory limitations, it will only work if the number of > hits > > is small enough, am I right? > > One can load a single term vector at a time, so in this case the memory > limitation is only in the possibly large map of doc counters per term. > For best performance try and load the term vectors in docId order, > after the original query has completed. > > In any case it would be good to somehow limit the number of > documents considered, for example by using the ones with the best > query score. > > Limiting the number of terms would also be good, but that less easy. > > Regards, > Paul Elschot > > > > > Chris > > > > 2009/10/12 Paul Elschot <paul.elsc...@xs4all.nl> > > > > > Chris, > > > > > > You could also store term vectors for all docs at indexing > > > time, and add the termvectors for the matching docs into a > > > (large) map of terms in RAM. > > > > > > Regards, > > > Paul Elschot > > > > > > > > > On Monday 12 October 2009 21:30:48 Christoph Boosz wrote: > > > > Hi Jake, > > > > > > > > Thanks for your helpful explanation. > > > > In fact, my initial solution was to traverse each document in the > result > > > > once and count the contained terms. As you mentioned, this process > took a > > > > lot of memory. > > > > Trying to confine the memory usage with the facet approach, I was > > > surprised > > > > by the decline in performance. > > > > Now I know it's nothing abnormal, at least. > > > > > > > > Chris > > > > > > > > > > > > 2009/10/12 Jake Mannix <jake.man...@gmail.com> > > > > > > > > > Hey Chris, > > > > > > > > > > On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz < > > > > > christoph.bo...@googlemail.com> wrote: > > > > > > > > > > > Thanks for your reply. > > > > > > Yes, it's likely that many terms occur in few documents. > > > > > > > > > > > > If I understand you right, I should do the following: > > > > > > -Write a HitCollector that simply increments a counter > > > > > > -Get the filter for the user query once: new > CachingWrapperFilter(new > > > > > > QueryWrapperFilter(userQuery)); > > > > > > -Create a TermQuery for each term > > > > > > -Perform the search and read the counter of the HitCollector > > > > > > > > > > > > I did that, but it didn't get faster. Any ideas why? > > > > > > > > > > > > > > > > This killer is the "TermQuery for each term" part - this is huge. > You > > > need > > > > > to invert this process, > > > > > and use your query as is, but while walking in the HitCollector, on > > > each > > > > > doc > > > > > which matches > > > > > your query, increment counters for each of the terms in that > document > > > > > (which > > > > > means you need > > > > > an in-memory forward lookup for your documents, like a multivalued > > > > > FieldCache - and if you've > > > > > got roughly the same number of terms as documents, this cache is > likely > > > to > > > > > be as large as > > > > > your entire index - a pretty hefty RAM cost). > > > > > > > > > > But a good thing to keep in mind is that doing this kind of > faceting > > > > > (massively multivalued > > > > > on a huge term-set) requires a lot of computation, even if you have > all > > > the > > > > > proper structures > > > > > living in memory: > > > > > > > > > > For each document you look at (which matches your query), you need > to > > > look > > > > > at all > > > > > of the terms in that document, and increment a counter for that > term. > > > So > > > > > however much > > > > > time it would normally take for you to do the driving query, it can > > > take as > > > > > much as that > > > > > multiplied by the average number of terms in a document in your > index. > > > If > > > > > your documents > > > > > are big, this could be a pretty huge latency penalty. > > > > > > > > > > -jake > > > > > > > > > > > > > > > > > > >