Re: Flexible indexing (was: Re: [jira] Commented: (LUCENE-755) Payloads)

Grant Ingersoll Sat, 10 Mar 2007 15:54:19 -0800

Hi Michael,

This is very good. I know 662 is different, just wasn't sure ifNicolas patch was meant to be applied after 662, b/c I know we haddiscussed this before.

I do agree with you about planning this out, but I also know thatpatches seem to motivate people the best and provide a certainconcreteness to it all. I mostly started asking questions on thesetwo issues b/c I wanted to spur some more discussion and see if wecan get people motivated to move on it.

I was hoping that I would be able to apply each patch to twodifferent checkouts so I could start seeing where the overlap is andhow they could fit together (I also admit I was procrastinating on myApacheCon talk...). In the new, flexible world, the payloadsimplementation could be a separate implementation of the indexing orit could be part of the core/existing file format implementation.Sometimes I just need to get my hands on the code to get a real feelfor what I feel is the best way to do it.

I agree about the XML storage for Index information. We do that inour in-house wrapper around Lucene, storing info about the language,analyzer used, etc. We may also want a binary index-level storagecapability. I know most people just create a single document usuallyto store binary info about the index, but an binary storage might begood too.

Part of me says to apply the Payloads patch now, as it provides a lotof bang for the buck and I think the FI is going to take a lot longerto hash out. However, I know that it may pin us in or force us tochange things for FI. Ultimately, I would love to see both thesefeatures for the next release, but that isn't a requirement. Also,on FI, I would love to see two different implementations of whateverAPI we choose before releasing it, as I always find twoimplementations of an Interface really work out the API details.


-Grant


On Mar 10, 2007, at 6:27 PM, Michael Busch wrote:

Hi Grant,

LUCENE-662 contains different ideas:
1) introduction of an index format concept
2) extensibility of the store reader/writer
3) New: extensibility of the posting reader/writer
IMO we should split this up, that way it will be easier to developsmaller patches that focus on adding one particular feature.However, it is important to plan the API, so that different patches(like payloads) fit in. On the other hand it will be nearlyimpossible to plan an API that is perfect and won't change anymorewithout having the actual implementions. Therefore I suggest thefollowing steps:
a) define the different work items of flexible indexing
b) plan a API rougly that fits with all items
c) develop the different items, commit them but with eitherprotected or as experimental marked APIsd) after all items are completed and committed (and hopefullytested by some brave community members ;)) finalize the API andremove experimental comments (or make public)
Let's start with a):
The following items come to my mind (please feel free to add/remove/complain):- Introduce index-level metadata. Preferable in XML format, so itwill be human readable. Later on, we can store information aboutthe index format in this file, like the codecs that are used tostore the data. We should also make this public, so that users canstore their own index metadata. (Remark: LUCENE-783 is also a neatidea, we can write one xml parser for both items)
- Introduce index format. Nicolas has already written a lot of codein this regard! It will include different interfaces for thedifferent extension points (FieldsFormat, PostingFormat,DictionaryFormat). We can use the xml file to store which actualformats are used in the corresponding index.
- Implement the different extensions. LUCENE-662 includes anextensible FieldsWriter, LUCENE-755 the payloads feature. Doug andNing suggested already nice interfaces for PostingFormat andDictionaryFormat in the payloads thread on java-dev.
- Write standard implementations for the different formats. In thewiki is already a list of desired posting formats.
I suggest we should finalize this list first. Then I will add thislist to the wiki under Flexible indexing and gather informationfrom the different discussions on java-dev which I alreadymentioned. Then we should discuss the different items of this listin greater depth and plan the APIs (step b) ). And then we'realready ready for step c) and the fun starts :-).
Michael


Grant Ingersoll wrote:
I think it makes the most sense to get flexible indexing in first,and then make payloads work with it. On the other hand, payloadslooked pretty straightforward to me, whereas FI is much moreinvolved (or at least it feels that way).
As it is right now, I would like to at least review the twopatches and start thinking about them in greater depth. Thepayloads patch needs a little more work in that I want tointegrate it with the Similarity class so people can customizetheir scoring.
-Grant

On Mar 10, 2007, at 9:30 AM, Nicolas Lalevée (JIRA) wrote:
[ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479841 ]
Nicolas Lalevée commented on LUCENE-755:
----------------------------------------

Grant>
The patch I have propsed here has no dependency on LUCENE-662, Ijust "imported" some ideas from it and put them there. Since theLUCENE-662 have involved, the patches will probably makeconflicts. The best to use here is Michael's one. I think itwon't conflit with LUCENE-662. And if both are intended to becommited, then the best is to commit the both seperately and redothe work I have done with the provided patch (I remember that itwas quite easy).
Payloads
--------

                Key: LUCENE-755
URL: https://issues.apache.org/jira/browse/LUCENE-755
            Project: Lucene - Java
         Issue Type: New Feature
         Components: Index
           Reporter: Michael Busch
        Assigned To: Michael Busch
        Attachments: payload.patch, payloads.patch
This patch adds the possibility to store arbitrary metadata(payloads) together with each position of a term in its postinglists. A while ago this was discussed on the dev mailing list,where I proposed an initial design. This patch has a muchimproved design with modifications, that make this new featureeasier to use and more efficient.A payload is an array of bytes that can be stored inline in theProxFile (.prx). Therefore this patch provides low-level APIs tosimply store and retrieve byte arrays in the posting lists in anefficient way.
API and Usage
------------------------------
The new class index.Payload is basically just a wrapper around abyte[] array together with int variables for offset and length.So a user does not have to create a byte array for everypayload, but can rather allocate one array for all payloads of adocument and provide offset and length information. This reducesobject allocations on the application side.In order to store payloads in the posting lists one has toprovide a TokenStream or TokenFilter that produces Tokens withpayloads. I added the following two methods to the Token class:
  /** Sets this Token's payload. */
  public void setPayload(Payload payload);

  /** Returns this Token's payload. */
  public Payload getPayload();
In order to retrieve the data from the index the interfaceTermPositions now offers two new methods:
  /** Returns the payload length of the current term position.
   *  This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
   *  the first time.
   *
   * @return length of the current payload in number of bytes
   */
  int getPayloadLength();

  /** Returns the payload data of the current term position.
   * This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
   * the first time.
   * This method must not be called more than once after each call
* of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loadedlazily,
   * so if the payload data for the current position is not needed,
   * this method may not be called at all for performance reasons.
   *
* @param data the array into which the data of this payloadis to be* stored, if it is big enough; otherwise, a newbyte[] array
   *             is allocated for this purpose.
* @param offset the offset in the array into which the dataof this payload
   *               is to be stored.
   * @return a byte[] array containing the data of this payload
   * @throws IOException
   */
  byte[] getPayload(byte[] data, int offset) throws IOException;
Furthermore, this patch indroduces the new methodIndexOutput.writeBytes(byte[] b, int offset, int length). So farthere was only a writeBytes()-method without an offset argument.
Implementation details
------------------------------
- One field bit in FieldInfos is used to indicate if payloadsare enabled for a field. The user does not have to enablepayloads for a field, this is done automatically:* The DocumentWriter enables payloads for a field, if one oremore Tokens carry payloads.* The SegmentMerger enables payloads for a field during amerge, if payloads are enabled for that field in one or moresegments.- Backwards compatible: If payloads are not used, then theformats of the ProxFile and FreqFile don't change- Payloads are stored inline in the posting list of a term inthe ProxFile. A payload of a term occurrence is stored rightafter its PositionDelta.- Same-length compression: If payloads are enabled for a field,then the PositionDelta is shifted one bit. The lowest bit isused to indicate whether the length of the following payload isstored explicitly. If not, i. e. the bit is false, then thepayload has the same length as the payload of the previous termoccurrence.- In order to support skipping on the ProxFile the length of thepayload at every skip point has to be known. Therefore thepayload length is also stored in the skip list located in theFreqFile. Here the same-length compression is also used: Thelowest bit of DocSkip is used to indicate if the payload lengthis stored for a SkipDatum or if the length is the same as in thelast SkipDatum.- Payloads are loaded lazily. When a user callsTermPositions.nextPosition() then only the position and thepayload length is loaded from the ProxFile. If the user callsgetPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then thepayload data is just skipped.
Changes of file formats
------------------------------
- FieldInfos (.fnm)
The format of the .fnm file does not change. The only change isthe use of the sixth lowest-order bit (0x20) of the FieldBits.If this bit is set, then payloads are enabled for thecorresponding field.
- ProxFile (.prx)
ProxFile (.prx) -->  <TermPositions>^TermCount
TermPositions   --> <Positions>^DocFreq
Positions       --> <PositionDelta, Payload?>^Freq
Payload         --> <PayloadLength?, PayloadData>
PositionDelta   --> VInt
PayloadLength   --> VInt
PayloadData     --> byte^PayloadLength
For payloads disabled (unchanged):
PositionDelta is the difference between the position of thecurrent occurrence in the document and the previous occurrence(or zero, if this is the first occurrence in this document).
For Payloads enabled:
PositionDelta/2 is the difference between the position of thecurrent occurrence in the document and the previous occurrence.If PositionDelta is odd, then PayloadLength is stored. IfPositionDelta is even, then the length of the current payloadequals the length of the previous payload and thus PayloadLengthis omitted.
- FreqFile (.frq)
SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
PayloadLength --> VInt
For payloads disabled (unchanged):
DocSkip records the document number before every SkipInterval thdocument in TermFreqs. Document numbers are represented asdifferences from the previous value in the sequence.
For payloads enabled:
DocSkip/2 records the document number before every SkipIntervalth document in TermFreqs. If DocSkip is odd, then PayloadLengthfollows. If DocSkip is even, then the length of the payload atthe current skip point equals the length of the payload at thelast skip point and thus PayloadLength is omitted.
This encoding is space efficient for different use cases:
* If only some fields of an index have payloads, then there'sno space overhead for the fields with payloads disabled.* If the payloads of consecutive term positions have the samelength, then the length only has to be stored once for everyterm. This should be a common case, because users probably usethe same format for all payloads.* If only a few terms of a field have payloads, then we don'twaste much space because we benefit again from the same-length-compression since we only have to store the length zero for theempty payloads once per term.
All unit tests pass.
--This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Flexible indexing (was: Re: [jira] Commented: (LUCENE-755) Payloads)

Reply via email to