[
https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12481003
]
Grant Ingersoll commented on LUCENE-755:
----------------------------------------
OK, I've applied the patch. All tests pass for me. I think it looks
good. Have you run any benchmarks on it? I ran the standard one on
the patched version and on trunk, in a totally unscientific test. In
theory, the case with no payloads should perform very closely to the
existing code, and this seems to be born out by me running the micro-
standard (ant run-task in contrib/benchmark). Once we have this
committed someone can take a crack at adding support to the
benchmarker for payloads.
Payload should probably be serializable.
All in all, I think we could commit this, then adding the search/
scoring capabilities like we've talked about. I like the
documentation/comments you have added, very useful. (One of these
days I will take on documenting the index package like I intend to,
so what you've added will be quite helpful!) We will/may want to
add in, for example, a PayloadQuery and derivatives and a QueryParser
operator that supported searching in the payload, or possibly
boosting if a certain term has a certain type of payload (not that I
want anything to do with the QueryParser). Even beyond that,
SpanPayloadQuery, etc. I will possibly have some cycles to actually
write some code for these next week.
Just throwing this out there, I'm not sure I really mean it or
not :-), but:
do you think it would be useful to consider restricting the size of
the payload? I know, I know, as soon as we put a limit on it,
someone will want to expand it, but I was thinking if we knew the
size had a limit we could better control the performance and caching,
etc. on the scoring/search side. I guess it is buyer beware, maybe
we put some javadocs on this.
Also, I started http://wiki.apache.org/lucene-java/Payloads as I
think we will want to have some docs explaining why Payloads are
useful in non-javadoc format.
On a side note, have a look at http://wiki.apache.org/lucene-java/
PatchCheckList to see if there is anything you feel you can add.
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org
Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ
> Payloads
> --------
>
> Key: LUCENE-755
> URL: https://issues.apache.org/jira/browse/LUCENE-755
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Index
> Reporter: Michael Busch
> Assigned To: Michael Busch
> Attachments: payload.patch, payloads.patch, payloads.patch,
> payloads.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads)
> together with each position of a term in its posting lists. A while ago this
> was discussed on the dev mailing list, where I proposed an initial design.
> This patch has a much improved design with modifications, that make this new
> feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile
> (.prx). Therefore this patch provides low-level APIs to simply store and
> retrieve byte arrays in the posting lists in an efficient way.
> API and Usage
> ------------------------------
> The new class index.Payload is basically just a wrapper around a byte[] array
> together with int variables for offset and length. So a user does not have to
> create a byte array for every payload, but can rather allocate one array for
> all payloads of a document and provide offset and length information. This
> reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a
> TokenStream or TokenFilter that produces Tokens with payloads. I added the
> following two methods to the Token class:
> /** Sets this Token's payload. */
> public void setPayload(Payload payload);
>
> /** Returns this Token's payload. */
> public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now
> offers two new methods:
> /** Returns the payload length of the current term position.
> * This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
> * the first time.
> *
> * @return length of the current payload in number of bytes
> */
> int getPayloadLength();
>
> /** Returns the payload data of the current term position.
> * This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
> * the first time.
> * This method must not be called more than once after each call
> * of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loaded
> lazily,
> * so if the payload data for the current position is not needed,
> * this method may not be called at all for performance reasons.
> *
> * @param data the array into which the data of this payload is to be
> * stored, if it is big enough; otherwise, a new byte[] array
> * is allocated for this purpose.
> * @param offset the offset in the array into which the data of this payload
> * is to be stored.
> * @return a byte[] array containing the data of this payload
> * @throws IOException
> */
> byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method
> IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was
> only a writeBytes()-method without an offset argument.
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for
> a field. The user does not have to enable payloads for a field, this is done
> automatically:
> * The DocumentWriter enables payloads for a field, if one ore more Tokens
> carry payloads.
> * The SegmentMerger enables payloads for a field during a merge, if
> payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the
> ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A
> payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the
> PositionDelta is shifted one bit. The lowest bit is used to indicate whether
> the length of the following payload is stored explicitly. If not, i. e. the
> bit is false, then the payload has the same length as the payload of the
> previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at
> every skip point has to be known. Therefore the payload length is also stored
> in the skip list located in the FreqFile. Here the same-length compression is
> also used: The lowest bit of DocSkip is used to indicate if the payload
> length is stored for a SkipDatum or if the length is the same as in the last
> SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition()
> then only the position and the payload length is loaded from the ProxFile. If
> the user calls getPayload() then the payload is actually loaded. If
> getPayload() is not called before nextPosition() is called again, then the
> payload data is just skipped.
>
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of
> the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then
> payloads are enabled for the corresponding field.
> - ProxFile (.prx)
> ProxFile (.prx) --> <TermPositions>^TermCount
> TermPositions --> <Positions>^DocFreq
> Positions --> <PositionDelta, Payload?>^Freq
> Payload --> <PayloadLength?, PayloadData>
> PositionDelta --> VInt
> PayloadLength --> VInt
> PayloadData --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current
> occurrence in the document and the previous occurrence (or zero, if this is
> the first occurrence in this document).
>
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current
> occurrence in the document and the previous occurrence. If PositionDelta is
> odd, then PayloadLength is stored. If PositionDelta is even, then the length
> of the current payload equals the length of the previous payload and thus
> PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in
> TermFreqs. Document numbers are represented as differences from the previous
> value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th document
> in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is
> even, then the length of the payload at the current skip point equals the
> length of the payload at the last skip point and thus PayloadLength is
> omitted.
> This encoding is space efficient for different use cases:
> * If only some fields of an index have payloads, then there's no space
> overhead for the fields with payloads disabled.
> * If the payloads of consecutive term positions have the same length, then
> the length only has to be stored once for every term. This should be a common
> case, because users probably use the same format for all payloads.
> * If only a few terms of a field have payloads, then we don't waste much
> space because we benefit again from the same-length-compression since we only
> have to store the length zero for the empty payloads once per term.
> All unit tests pass.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]