Re: Proper use of TermsEnum.seek?

Toke Eskildsen Tue, 22 Feb 2011 02:56:27 -0800

On Mon, 2011-02-21 at 16:00 +0100, Simon Willnauer wrote:
> For all real codecs seek(BR, TermState) should be as fast as it gets.
> There are some codecs which simply forward to seek(BR) so if you have
> the TermState already you won't loose anything. This might also answer
> your other question, if you pass an empty BytesRef to a codec that did
> not override the seek(BR, TermState) method it will seek to the empty
> term and your code might not work anymore.


Thanks, that makes sense. 

It seems to me that I'll have to use the strategy pattern and make a
TermsEnum-implementation-aware wrapper (or rather codec-aware?), if I
want the "best" ordinal-seeker.

Toke:
> > I tried calling with an empty BytesRef term. This gave me an empty
> > result back for the call itself, but the correct terms for subsequent
> > calls to next. This works perfectly for my scenario. However, that was
> > just an experiment using the default variable gap codec, so I am unsure
> > if I can count on this behavior for any given codec?
> 
> what do you mean by an empty result for the call itself?

Sorry, I mixed things up. I mean I tried calling with an empty term and
getting the term with the term()-method, which returned an empty
BytesRef after the initial call. Anyway, since codec are free to fall
back to BytesRef-seek, my options are reduced to
seek(Bytesref, TermState) with real values
or
seek(Bytesref) which I expect is normally log(n) or better.

> can't you us a codec that supports ord for your facet / sort fields?

That was also Mike McCandless suggestion in
https://issues.apache.org/jira/browse/LUCENE-2843

I think this might be counter-productive. If a non-ordinal-supporting
codec has significantly lower impact on memory, the extra bookkeeping
for a BytesRef/TermState-seek-cache might be small enough so that the
total overhead is still less than that of an ordinal-supporting codec.

I did try a quick experiment with the variable gap vs. fixed gap codec,
where I kept every 32nd BytesRef+TermState for the variable gap. With a
50M term field, this increased the overhead from 600MB to 800MB (or
about 130 bytes for each BytesRef/TermState-pair, ignoring the
memory-impact-difference for variable vs. fixed). This clearly does not
support my theory. I'll have to make a proper test, but a strong
recommendation of using ordinal-supporting codec might very well be the
best solution.

Thanks for helping,
Toke Eskildsen


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Proper use of TermsEnum.seek?

Reply via email to