Re: [jira] Updated: (LUCENE-755) Payloads

Grant Ingersoll Sun, 11 Mar 2007 17:13:05 -0800

Cool. I will try and take a look at it tomorrow. Since we have thelazy SegTermPos thing in now, we should be able to integrate thisinto scoring via the Similarity and merge TermDocs and TermPositionslike you suggested.

If I can get the Scoring piece in and people are fine w/ theflushBuffer change then hopefully we can get this in this week. Iwill try to post a patch that includes your patch and the scoringintegration by tomorrow or Tuesday if that is fine with you.


-Grant

On Mar 11, 2007, at 8:35 PM, Michael Busch (JIRA) wrote:

[ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Busch updated LUCENE-755:
---------------------------------

    Attachment: payloads.patch

I'm attaching the new patch with the following changes:
- applies cleanly on the current trunk
- fixed a bug in FSDirectory which affected payloads with lengthgreater than 1024 bytes and extended testcase TestPayloads to testthis fix
- added the following warning comments to the new APIs:
* Warning: The status of the Payloads feature is experimental.The APIs* introduced here might change in the future and will not besupported anymore* in such a case. If you want to use this feature in aproduction environment
  *  you should wait for an official release.
Another comment about an API change: In BufferedIndexOutput Ichanged the methodprotected abstract void flushBuffer(byte[] b, int len) throwsIOException;
to
protected abstract void flushBuffer(byte[] b, int offset, intlen) throws IOException;
which means that subclasses of BufferedIndexOutput won't compileanymore. I made this change for performance reasons: If a payloadis longer than 1024 bytes (standard buffer size ofBufferedIndexOutput) then it can be flushed efficiently to diskwithout having to perform array copies.
Is this API change acceptable? Users who have custom subclasses ofBufferedIndexOutput would have to change their classes in order towork.
Payloads
--------

                Key: LUCENE-755
                URL: https://issues.apache.org/jira/browse/LUCENE-755
            Project: Lucene - Java
         Issue Type: New Feature
         Components: Index
           Reporter: Michael Busch
        Assigned To: Michael Busch
        Attachments: payload.patch, payloads.patch, payloads.patch
This patch adds the possibility to store arbitrary metadata(payloads) together with each position of a term in its postinglists. A while ago this was discussed on the dev mailing list,where I proposed an initial design. This patch has a much improveddesign with modifications, that make this new feature easier touse and more efficient.A payload is an array of bytes that can be stored inline in theProxFile (.prx). Therefore this patch provides low-level APIs tosimply store and retrieve byte arrays in the posting lists in anefficient way.
API and Usage
------------------------------
The new class index.Payload is basically just a wrapper around abyte[] array together with int variables for offset and length. Soa user does not have to create a byte array for every payload, butcan rather allocate one array for all payloads of a document andprovide offset and length information. This reduces objectallocations on the application side.In order to store payloads in the posting lists one has to providea TokenStream or TokenFilter that produces Tokens with payloads. Iadded the following two methods to the Token class:
  /** Sets this Token's payload. */
  public void setPayload(Payload payload);

  /** Returns this Token's payload. */
  public Payload getPayload();
In order to retrieve the data from the index the interfaceTermPositions now offers two new methods:
  /** Returns the payload length of the current term position.
   *  This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
   *  the first time.
   *
   * @return length of the current payload in number of bytes
   */
  int getPayloadLength();

  /** Returns the payload data of the current term position.
   * This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
   * the first time.
   * This method must not be called more than once after each call
   * of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loaded lazily,
   * so if the payload data for the current position is not needed,
   * this method may not be called at all for performance reasons.
   *
* @param data the array into which the data of this payload isto be* stored, if it is big enough; otherwise, a new byte[] array
   *             is allocated for this purpose.
* @param offset the offset in the array into which the data ofthis payload
   *               is to be stored.
   * @return a byte[] array containing the data of this payload
   * @throws IOException
   */
  byte[] getPayload(byte[] data, int offset) throws IOException;
Furthermore, this patch indroduces the new methodIndexOutput.writeBytes(byte[] b, int offset, int length). So farthere was only a writeBytes()-method without an offset argument.
Implementation details
------------------------------
- One field bit in FieldInfos is used to indicate if payloads areenabled for a field. The user does not have to enable payloads fora field, this is done automatically:* The DocumentWriter enables payloads for a field, if one oremore Tokens carry payloads.* The SegmentMerger enables payloads for a field during amerge, if payloads are enabled for that field in one or moresegments.- Backwards compatible: If payloads are not used, then the formatsof the ProxFile and FreqFile don't change- Payloads are stored inline in the posting list of a term in theProxFile. A payload of a term occurrence is stored right after itsPositionDelta.- Same-length compression: If payloads are enabled for a field,then the PositionDelta is shifted one bit. The lowest bit is usedto indicate whether the length of the following payload is storedexplicitly. If not, i. e. the bit is false, then the payload hasthe same length as the payload of the previous term occurrence.- In order to support skipping on the ProxFile the length of thepayload at every skip point has to be known. Therefore the payloadlength is also stored in the skip list located in the FreqFile.Here the same-length compression is also used: The lowest bit ofDocSkip is used to indicate if the payload length is stored for aSkipDatum or if the length is the same as in the last SkipDatum.- Payloads are loaded lazily. When a user callsTermPositions.nextPosition() then only the position and thepayload length is loaded from the ProxFile. If the user callsgetPayload() then the payload is actually loaded. If getPayload()is not called before nextPosition() is called again, then thepayload data is just skipped.
Changes of file formats
------------------------------
- FieldInfos (.fnm)
The format of the .fnm file does not change. The only change isthe use of the sixth lowest-order bit (0x20) of the FieldBits. Ifthis bit is set, then payloads are enabled for the correspondingfield.
- ProxFile (.prx)
ProxFile (.prx) -->  <TermPositions>^TermCount
TermPositions   --> <Positions>^DocFreq
Positions       --> <PositionDelta, Payload?>^Freq
Payload         --> <PayloadLength?, PayloadData>
PositionDelta   --> VInt
PayloadLength   --> VInt
PayloadData     --> byte^PayloadLength
For payloads disabled (unchanged):
PositionDelta is the difference between the position of thecurrent occurrence in the document and the previous occurrence (orzero, if this is the first occurrence in this document).
For Payloads enabled:
PositionDelta/2 is the difference between the position of thecurrent occurrence in the document and the previous occurrence. IfPositionDelta is odd, then PayloadLength is stored. IfPositionDelta is even, then the length of the current payloadequals the length of the previous payload and thus PayloadLengthis omitted.
- FreqFile (.frq)
SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
PayloadLength --> VInt
For payloads disabled (unchanged):
DocSkip records the document number before every SkipInterval thdocument in TermFreqs. Document numbers are represented asdifferences from the previous value in the sequence.
For payloads enabled:
DocSkip/2 records the document number before every SkipIntervalth document in TermFreqs. If DocSkip is odd, then PayloadLengthfollows. If DocSkip is even, then the length of the payload at thecurrent skip point equals the length of the payload at the lastskip point and thus PayloadLength is omitted.
This encoding is space efficient for different use cases:
* If only some fields of an index have payloads, then there'sno space overhead for the fields with payloads disabled.* If the payloads of consecutive term positions have the samelength, then the length only has to be stored once for every term.This should be a common case, because users probably use the sameformat for all payloads.* If only a few terms of a field have payloads, then we don'twaste much space because we benefit again from the same-length-compression since we only have to store the length zero for theempty payloads once per term.
All unit tests pass.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Updated: (LUCENE-755) Payloads

Reply via email to