On Sat, Feb 16, 2013 at 7:05 AM, Sebastiano Vigna <[email protected]> wrote:
> On 16 February 2013 11:45, Robert Muir <[email protected]> wrote:
>
>> But forcing that wouldn't be testing the 4.1 index format, it would be
>> something else (something not interesting).
>
>
> Do you mind if I have my own share of knowledge and have my idea about
> interesting benchmarks? :)
>
> You didn't answer, but the undertext *seems* that counts are no longer
> interleaved. Again, is it the case?
>
> Forcing a count is an essential test for the index efficiency, as you need
> counts to do scoring. Testing with a scorer is not a good idea because the
> scorer CPU usage is difficult to control across different implementations.
> So the only way of testing a non-interleaved index against an interleaved
> index (or comparing the speed of count access against a non-interleaved
> index) is to force a count reading without any other activity.

I think you are missing my point: this interleaving is part of the
whole design of this postings format. You can't just turn it off and
force it to be always FOR: or you would need a new postings format
with a different design to match (it would need to encode term
dictionary and skip data and other things differently, and track
offsets within FOR blocks and other things).

For example by only recording full FOR blocks of 128 that are not
shared across terms, there are less pointers (e.g. term dictionary)
and other upkeep and hassle. And thats an important part of the design
of this format, that the block encoding doesn't need to worry about
being efficient for all these low frequency terms, we just continue to
encode them pretty much as we did before. It makes the index slightly
larger, but simplifies the case that does matter, and after all, these
low frequency terms aren't bottlenecking anyone anyway.

>
> So essentially you code every blocks of 128 postings using FOR, but fall
> back to VByte for the tail ( <128). For low-frequency terms, this means just
> VByte. Right?
>

Thats right. Also keep in mind: in the FOR case the blocks themselves
are interleaved, so you have a block of 128 doc deltas, then a block
of 128 freqs follow, then 128 doc deltas again, then 128 freqs.
finally the vint remainder is docs+freqs interleaved as vints.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to