Re: Attributes, DocConsumer, Flexible Indexing, etc.

Michael McCandless Wed, 05 Aug 2009 15:11:14 -0700

So basically you could build all of this out on top of the existing
payloads extensibility?  It sounds promising!


LUCENE-1458 also enables neat low-level index format changes like
pulsing (rare terms inline their postings directly into the terms
dict, which saves 2 seeks) and [eventually] PForDelta (more
cpu-friendly int array encoding than vInt).

Mike

On Wed, Aug 5, 2009 at 5:55 PM, Grant Ingersoll<gsing...@apache.org> wrote:
>
> On Aug 5, 2009, at 4:35 PM, Michael Busch wrote:
>
>> On 8/5/09 1:07 PM, Grant Ingersoll wrote:
>>>
>>> Hmmm, OK.
>>>
>>> Random, somewhat uneducated thought:  Why not just define the codecs to
>>> create byte arrays?  Then we can use the existing payload capability much
>>> like I do with the DelimitedPayloadTokenFilter.   We'd probably have to make
>>> sure this still worked with Similarity, but it seems like it could.
>>>  Thinking on this some more, seems like this could work already with a a
>>> AttributePayloadEncoder or something like an AttributeToPayloadTokenFilter
>>> (I know, horrible name).  Then, on the Query side, the AttributeTermQuery is
>>> just a glorified BoostingTermQuery with some callback hooks for dealing with
>>> the Attribute (but maybe that isn't even needed), either that or we just
>>> provide helper methods to the Similarity class so that people can easily
>>> decode the byte array into an Attribute.  In fact, maybe all that needs to
>>> happen is the Attributes need to define encode/decode methods that
>>> (de)serialize a byte array.
>>>
>>> Seems like this approach would require very little in the way of changes
>>> to Lucene, but I admit it isn't fully baked in my mind just yet.  It also
>>> has the nice benefit that all the work we did on Payloads isn't wasted.
>>>
>>> This is resonating more and more with me.  What do you think?
>>>
>>
>> Well I think this would be a nice way of using the payloads better.
>>
>> However, the idea behind flexible indexing is that you can customize the
>> on-disk encoding in a way that it is as efficient as it can be for your
>> particular use case. E.g. for payloads we currently have to encode the
>> length. An application might not have to do that if it knows exactly what is
>> stored.
>> Then there's only the Payload API that returns you a byte array. It
>> basically copies the contents of the IndexInput (usually a
>> BufferedIndexInput, which means array copy from the byte buffer to the
>> payload byte array). If the application knows exactly what is stored it can
>> read/decode it more efficiently.
>
> Yeah, but really are you saving that much?  4 bytes per token?  It's not
> like you are saving much in terms of seeks, since you are already there
> anyway.  The only downside I see is a slightly larger index.  Would be
> interesting to try it out and see.
>
>
>
>
>>
>> The latter inefficiency we could solve by improving the payloads API: it
>> could return an IndexInput instead of the byte array and the caller could
>> consume it more efficient.
>
> This is also interesting, but again requires some changes.  With what I'm
> proposing, I think it could be done very simply w/o any API changes, and we
> just need to expose some of the IndexInput/Output helper classes a bit more
> to make it easier for people to encode/decode their stuff.  Then, just
> documentation and some more Boosting*Query (Peter has already done
> BoostingNearQuery) and I think you have a pretty good flexible indexing AND
> searching capability all in a back compatible way using our existing code.
>
>>
>> So I agree that we could use Attributes to make the payloads feature
>> better usable, but I don't think it will be a replacement for flexible
>> indexing.
>
>
>
>>
>> Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Attributes, DocConsumer, Flexible Indexing, etc.

Reply via email to