On Monday 12 October 2009 23:29:07 Christoph Boosz wrote:
> Hi Paul,
> 
> Thanks for your suggestion. I will test it within the next few days.
> However, due to memory limitations, it will only work if the number of hits
> is small enough, am I right?

One can load a single term vector at a time, so in this case the memory
limitation is only in the possibly large map of doc counters per term.
For best performance try and load the term vectors in docId order,
after the original query has completed.

In any case it would be good to somehow limit the number of
documents considered, for example by using the ones with the best
query score.

Limiting the number of terms would also be good, but that less easy.

Regards,
Paul Elschot

> 
> Chris
> 
> 2009/10/12 Paul Elschot <paul.elsc...@xs4all.nl>
> 
> > Chris,
> >
> > You could also store term vectors for all docs at indexing
> > time, and add the termvectors for the matching docs into a
> > (large) map of terms in RAM.
> >
> > Regards,
> > Paul Elschot
> >
> >
> > On Monday 12 October 2009 21:30:48 Christoph Boosz wrote:
> > > Hi Jake,
> > >
> > > Thanks for your helpful explanation.
> > > In fact, my initial solution was to traverse each document in the result
> > > once and count the contained terms. As you mentioned, this process took a
> > > lot of memory.
> > > Trying to confine the memory usage with the facet approach, I was
> > surprised
> > > by the decline in performance.
> > > Now I know it's nothing abnormal, at least.
> > >
> > > Chris
> > >
> > >
> > > 2009/10/12 Jake Mannix <jake.man...@gmail.com>
> > >
> > > > Hey Chris,
> > > >
> > > > On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz <
> > > > christoph.bo...@googlemail.com> wrote:
> > > >
> > > > > Thanks for your reply.
> > > > > Yes, it's likely that many terms occur in few documents.
> > > > >
> > > > > If I understand you right, I should do the following:
> > > > > -Write a HitCollector that simply increments a counter
> > > > > -Get the filter for the user query once: new CachingWrapperFilter(new
> > > > > QueryWrapperFilter(userQuery));
> > > > > -Create a TermQuery for each term
> > > > > -Perform the search and read the counter of the HitCollector
> > > > >
> > > > > I did that, but it didn't get faster. Any ideas why?
> > > > >
> > > >
> > > > This killer is the "TermQuery for each term" part - this is huge. You
> > need
> > > > to invert this process,
> > > > and use your query as is, but while walking in the HitCollector, on
> > each
> > > > doc
> > > > which matches
> > > > your query, increment counters for each of the terms in that document
> > > > (which
> > > > means you need
> > > > an in-memory forward lookup for your documents, like a multivalued
> > > > FieldCache - and if you've
> > > > got roughly the same number of terms as documents, this cache is likely
> > to
> > > > be as large as
> > > > your entire index - a pretty hefty RAM cost).
> > > >
> > > > But a good thing to keep in mind is that doing this kind of faceting
> > > > (massively multivalued
> > > > on a huge term-set) requires a lot of computation, even if you have all
> > the
> > > > proper structures
> > > > living in memory:
> > > >
> > > > For each document you look at (which matches your query), you need to
> > look
> > > > at all
> > > > of the terms in that document, and increment a counter for that term.
> >  So
> > > > however much
> > > > time it would normally take for you to do the driving query, it can
> > take as
> > > > much as that
> > > > multiplied by the average number of terms in a document in your index.
> >  If
> > > > your documents
> > > > are big, this could be a pretty huge latency penalty.
> > > >
> > > >  -jake
> > > >
> > >
> >
> >
> 

Reply via email to