Payloads
--------
Key: LUCENE-755
URL: https://issues.apache.org/jira/browse/
LUCENE-755
Project: Lucene - Java
Issue Type: New Feature
Components: Index
Reporter: Michael Busch
Assigned To: Michael Busch
Attachments: payload.patch, payloads.patch
This patch adds the possibility to store arbitrary metadata
(payloads) together with each position of a term in its posting
lists. A while ago this was discussed on the dev mailing list,
where I proposed an initial design. This patch has a much
improved design with modifications, that make this new feature
easier to use and more efficient.
A payload is an array of bytes that can be stored inline in the
ProxFile (.prx). Therefore this patch provides low-level APIs to
simply store and retrieve byte arrays in the posting lists in an
efficient way.
API and Usage
------------------------------
The new class index.Payload is basically just a wrapper around a
byte[] array together with int variables for offset and length.
So a user does not have to create a byte array for every
payload, but can rather allocate one array for all payloads of a
document and provide offset and length information. This reduces
object allocations on the application side.
In order to store payloads in the posting lists one has to
provide a TokenStream or TokenFilter that produces Tokens with
payloads. I added the following two methods to the Token class:
/** Sets this Token's payload. */
public void setPayload(Payload payload);
/** Returns this Token's payload. */
public Payload getPayload();
In order to retrieve the data from the index the interface
TermPositions now offers two new methods:
/** Returns the payload length of the current term position.
* This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
* the first time.
*
* @return length of the current payload in number of bytes
*/
int getPayloadLength();
/** Returns the payload data of the current term position.
* This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
* the first time.
* This method must not be called more than once after each call
* of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loaded
lazily,
* so if the payload data for the current position is not needed,
* this method may not be called at all for performance reasons.
*
* @param data the array into which the data of this payload
is to be
* stored, if it is big enough; otherwise, a new
byte[] array
* is allocated for this purpose.
* @param offset the offset in the array into which the data
of this payload
* is to be stored.
* @return a byte[] array containing the data of this payload
* @throws IOException
*/
byte[] getPayload(byte[] data, int offset) throws IOException;
Furthermore, this patch indroduces the new method
IndexOutput.writeBytes(byte[] b, int offset, int length). So far
there was only a writeBytes()-method without an offset argument.
Implementation details
------------------------------
- One field bit in FieldInfos is used to indicate if payloads
are enabled for a field. The user does not have to enable
payloads for a field, this is done automatically:
* The DocumentWriter enables payloads for a field, if one ore
more Tokens carry payloads.
* The SegmentMerger enables payloads for a field during a
merge, if payloads are enabled for that field in one or more
segments.
- Backwards compatible: If payloads are not used, then the
formats of the ProxFile and FreqFile don't change
- Payloads are stored inline in the posting list of a term in
the ProxFile. A payload of a term occurrence is stored right
after its PositionDelta.
- Same-length compression: If payloads are enabled for a field,
then the PositionDelta is shifted one bit. The lowest bit is
used to indicate whether the length of the following payload is
stored explicitly. If not, i. e. the bit is false, then the
payload has the same length as the payload of the previous term
occurrence.
- In order to support skipping on the ProxFile the length of the
payload at every skip point has to be known. Therefore the
payload length is also stored in the skip list located in the
FreqFile. Here the same-length compression is also used: The
lowest bit of DocSkip is used to indicate if the payload length
is stored for a SkipDatum or if the length is the same as in the
last SkipDatum.
- Payloads are loaded lazily. When a user calls
TermPositions.nextPosition() then only the position and the
payload length is loaded from the ProxFile. If the user calls
getPayload() then the payload is actually loaded. If getPayload
() is not called before nextPosition() is called again, then the
payload data is just skipped.
Changes of file formats
------------------------------
- FieldInfos (.fnm)
The format of the .fnm file does not change. The only change is
the use of the sixth lowest-order bit (0x20) of the FieldBits.
If this bit is set, then payloads are enabled for the
corresponding field.
- ProxFile (.prx)
ProxFile (.prx) --> <TermPositions>^TermCount
TermPositions --> <Positions>^DocFreq
Positions --> <PositionDelta, Payload?>^Freq
Payload --> <PayloadLength?, PayloadData>
PositionDelta --> VInt
PayloadLength --> VInt
PayloadData --> byte^PayloadLength
For payloads disabled (unchanged):
PositionDelta is the difference between the position of the
current occurrence in the document and the previous occurrence
(or zero, if this is the first occurrence in this document).
For Payloads enabled:
PositionDelta/2 is the difference between the position of the
current occurrence in the document and the previous occurrence.
If PositionDelta is odd, then PayloadLength is stored. If
PositionDelta is even, then the length of the current payload
equals the length of the previous payload and thus PayloadLength
is omitted.
- FreqFile (.frq)
SkipDatum --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
PayloadLength --> VInt
For payloads disabled (unchanged):
DocSkip records the document number before every SkipInterval th
document in TermFreqs. Document numbers are represented as
differences from the previous value in the sequence.
For payloads enabled:
DocSkip/2 records the document number before every SkipInterval
th document in TermFreqs. If DocSkip is odd, then PayloadLength
follows. If DocSkip is even, then the length of the payload at
the current skip point equals the length of the payload at the
last skip point and thus PayloadLength is omitted.
This encoding is space efficient for different use cases:
* If only some fields of an index have payloads, then there's
no space overhead for the fields with payloads disabled.
* If the payloads of consecutive term positions have the same
length, then the length only has to be stored once for every
term. This should be a common case, because users probably use
the same format for all payloads.
* If only a few terms of a field have payloads, then we don't
waste much space because we benefit again from the same-length-
compression since we only have to store the length zero for the
empty payloads once per term.
All unit tests pass.