[ http://issues.apache.org/jira/browse/LUCENE-755?page=all ]
Michael Busch updated LUCENE-755: --------------------------------- Attachment: payloads.patch > Payloads > -------- > > Key: LUCENE-755 > URL: http://issues.apache.org/jira/browse/LUCENE-755 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Michael Busch > Assigned To: Michael Busch > Attachments: payloads.patch > > > This patch adds the possibility to store arbitrary metadata (payloads) > together with each position of a term in its posting lists. A while ago this > was discussed on the dev mailing list, where I proposed an initial design. > This patch has a much improved design with modifications, that make this new > feature easier to use and more efficient. > A payload is an array of bytes that can be stored inline in the ProxFile > (.prx). Therefore this patch provides low-level APIs to simply store and > retrieve byte arrays in the posting lists in an efficient way. > API and Usage > ------------------------------ > The new class index.Payload is basically just a wrapper around a byte[] array > together with int variables for offset and length. So a user does not have to > create a byte array for every payload, but can rather allocate one array for > all payloads of a document and provide offset and length information. This > reduces object allocations on the application side. > In order to store payloads in the posting lists one has to provide a > TokenStream or TokenFilter that produces Tokens with payloads. I added the > following two methods to the Token class: > /** Sets this Token's payload. */ > public void setPayload(Payload payload); > > /** Returns this Token's payload. */ > public Payload getPayload(); > In order to retrieve the data from the index the interface TermPositions now > offers two new methods: > /** Returns the payload length of the current term position. > * This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for > * the first time. > * > * @return length of the current payload in number of bytes > */ > int getPayloadLength(); > > /** Returns the payload data of the current term position. > * This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for > * the first time. > * This method must not be called more than once after each call > * of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loaded > lazily, > * so if the payload data for the current position is not needed, > * this method may not be called at all for performance reasons. > * > * @param data the array into which the data of this payload is to be > * stored, if it is big enough; otherwise, a new byte[] array > * is allocated for this purpose. > * @param offset the offset in the array into which the data of this payload > * is to be stored. > * @return a byte[] array containing the data of this payload > * @throws IOException > */ > byte[] getPayload(byte[] data, int offset) throws IOException; > Furthermore, this patch indroduces the new method > IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was > only a writeBytes()-method without an offset argument. > Implementation details > ------------------------------ > - One field bit in FieldInfos is used to indicate if payloads are enabled for > a field. The user does not have to enable payloads for a field, this is done > automatically: > * The DocumentWriter enables payloads for a field, if one ore more Tokens > carry payloads. > * The SegmentMerger enables payloads for a field during a merge, if > payloads are enabled for that field in one or more segments. > - Backwards compatible: If payloads are not used, then the formats of the > ProxFile and FreqFile don't change > - Payloads are stored inline in the posting list of a term in the ProxFile. A > payload of a term occurrence is stored right after its PositionDelta. > - Same-length compression: If payloads are enabled for a field, then the > PositionDelta is shifted one bit. The lowest bit is used to indicate whether > the length of the following payload is stored explicitly. If not, i. e. the > bit is false, then the payload has the same length as the payload of the > previous term occurrence. > - In order to support skipping on the ProxFile the length of the payload at > every skip point has to be known. Therefore the payload length is also stored > in the skip list located in the FreqFile. Here the same-length compression is > also used: The lowest bit of DocSkip is used to indicate if the payload > length is stored for a SkipDatum or if the length is the same as in the last > SkipDatum. > - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() > then only the position and the payload length is loaded from the ProxFile. If > the user calls getPayload() then the payload is actually loaded. If > getPayload() is not called before nextPosition() is called again, then the > payload data is just skipped. > > Changes of file formats > ------------------------------ > - FieldInfos (.fnm) > The format of the .fnm file does not change. The only change is the use of > the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then > payloads are enabled for the corresponding field. > - ProxFile (.prx) > ProxFile (.prx) --> <TermPositions>^TermCount > TermPositions --> <Positions>^DocFreq > Positions --> <PositionDelta, Payload?>^Freq > Payload --> <PayloadLength?, PayloadData> > PositionDelta --> VInt > PayloadLength --> VInt > PayloadData --> byte^PayloadLength > For payloads disabled (unchanged): > PositionDelta is the difference between the position of the current > occurrence in the document and the previous occurrence (or zero, if this is > the first occurrence in this document). > > For Payloads enabled: > PositionDelta/2 is the difference between the position of the current > occurrence in the document and the previous occurrence. If PositionDelta is > odd, then PayloadLength is stored. If PositionDelta is even, then the length > of the current payload equals the length of the previous payload and thus > PayloadLength is omitted. > - FreqFile (.frq) > SkipDatum --> DocSkip, PayloadLength?, FreqSkip, ProxSkip > PayloadLength --> VInt > For payloads disabled (unchanged): > DocSkip records the document number before every SkipInterval th document in > TermFreqs. Document numbers are represented as differences from the previous > value in the sequence. > For payloads enabled: > DocSkip/2 records the document number before every SkipInterval th document > in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is > even, then the length of the payload at the current skip point equals the > length of the payload at the last skip point and thus PayloadLength is > omitted. > This encoding is space efficient for different use cases: > * If only some fields of an index have payloads, then there's no space > overhead for the fields with payloads disabled. > * If the payloads of consecutive term positions have the same length, then > the length only has to be stored once for every term. This should be a common > case, because users probably use the same format for all payloads. > * If only a few terms of a field have payloads, then we don't waste much > space because we benefit again from the same-length-compression since we only > have to store the length zero for the empty payloads once per term. > All unit tests pass. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]