Ah thanks. I will make the change, ensure all tests are succeeding, add some
tests of my own, and commit a patch. Would be great to get a feeling of the
performance impact.

The change I'm making is changing the Collector interface from
public void collect(int doc) to
public boolean collect(int doc)

With a default "false" return code. If a collector returns "true" the caller
may stop collecting. So all normal collectors would simply return "false".
This could introduce some performance overhead of adding a return value to
the call stack, and some checking in the calling code.

However, for some other use cases, like

   - checking to see if a term actually has occurrences (to filter a
   suggested-terms list quickly), including possible applied facets, or
   - getting a paged result list for which order or sorting is not
   important, like having a lazy result block
   - facet overlap computation (venn diagrams)

This would have a signficant (order of magnitude) performance improvement
for large indexes. Those are the kind of use cases that I think get more and
more important.

Of course, this trick could be implemented by a new class LazyCollector as
well of course, possibly wrapping older Collector objects to it, but I
wanted to test out the functionality first and then worry about the best way
to integrate this with the existing lucene code base. The impact on changing
the Collector interface on existing integrations and lucene user code is
significant too

I will keep you guys posted on progress, thanks,

Anne

On Wed, Sep 7, 2011 at 14:34, Michael McCandless
<luc...@mikemccandless.com>wrote:

> This sounds like a neat patch -- what changes are you exploring to
> Collector?
>
> These days when I need to test indexing or searching performance I
> usually use the Python scripts from here:
>
>    https://hg.codespot.com/a/apache-extras.org/luceneutil
>
> Unfortunately they are rather involved to set up (which we need to fix)...
>
> If you open a Jira issue and attach a patch I'd be happy to performance
> test it.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, Sep 7, 2011 at 3:35 AM, Anne Veling <a...@beyondtrees.com> wrote:
> > I've been following the Lucene/Solr community for a long time and finally
> > have found (or: taken) the time to start implementing some of my ideas
> how
> > to improve on it; this will be my first proposed patch.
> > I'm working on some changes to the Collector API to significantly improve
> > the performance of some use cases, but my change may have a negative
> effect
> > on other use cases (though I doubt it), including memory resources. Of
> > course such an effect would only be measurable for larger index sizes.
> > My question is: how can I best test this? Is there a common dataset/index
> > that is used to verify that patches do not degrade search performance? I
> can
> > do some testing on my own wikipedia index of course, but I guess that
> > aligning with the performance tools you guys are using, will be better
> > Thank you, keep up the good work,
> > Anne
> >
> > --
> > Anne Veling
> > BeyondTrees.com
> > +31 6 50 969 170
> > @anneveling
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


-- 
Anne Veling
BeyondTrees.com
+31 6 50 969 170
@anneveling

Reply via email to