[ 
https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12481003
 ] 

Grant Ingersoll commented on LUCENE-755:
----------------------------------------

OK, I've applied the patch.  All tests pass for me.  I think it looks  
good.  Have you run any benchmarks on it?  I ran the standard one on  
the patched version and on trunk, in a totally unscientific test.  In  
theory, the case with no payloads should perform very closely to the  
existing code, and this seems to be born out by me running the micro- 
standard (ant run-task in contrib/benchmark).   Once we have this  
committed someone can take a crack at adding support to the  
benchmarker for payloads.

Payload should probably be serializable.

All in all, I think we could commit this, then adding the search/ 
scoring capabilities like we've talked about.  I like the  
documentation/comments you have added, very useful.  (One of these  
days I will take on documenting the index package like I intend to,  
so what you've added will be quite helpful!)   We will/may want to  
add in, for example, a PayloadQuery and derivatives and a QueryParser  
operator that supported searching in the payload, or possibly  
boosting if a certain term has a certain type of payload (not that I  
want anything to do with the QueryParser).  Even beyond that,  
SpanPayloadQuery, etc.  I will possibly have some cycles to actually  
write some code for these next week.

Just throwing this out there, I'm not sure I really mean it or  
not :-), but:
do you think it would be useful to consider restricting the size of  
the payload?  I know, I know, as soon as we put a limit on it,  
someone will want to expand it, but I was thinking if we knew the  
size had a limit we could better control the performance and caching,  
etc. on the scoring/search side.    I guess it is buyer beware, maybe  
we put some javadocs on this.

Also, I started http://wiki.apache.org/lucene-java/Payloads as I  
think we will want to have some docs explaining why Payloads are  
useful in non-javadoc format.

On a side note, have a look at http://wiki.apache.org/lucene-java/ 
PatchCheckList to see if there is anything you feel you can add.



--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ




> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payload.patch, payloads.patch, payloads.patch, 
> payloads.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) 
> together with each position of a term in its posting lists. A while ago this 
> was discussed on the dev mailing list, where I proposed an initial design. 
> This patch has a much improved design with modifications, that make this new 
> feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile 
> (.prx). Therefore this patch provides low-level APIs to simply store and 
> retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> ------------------------------   
> The new class index.Payload is basically just a wrapper around a byte[] array 
> together with int variables for offset and length. So a user does not have to 
> create a byte array for every payload, but can rather allocate one array for 
> all payloads of a document and provide offset and length information. This 
> reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a 
> TokenStream or TokenFilter that produces Tokens with payloads. I added the 
> following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now 
> offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
>    *  the first time.
>    * 
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>    * This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loaded 
> lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    * 
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose. 
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method 
> IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was 
> only a writeBytes()-method without an offset argument. 
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for 
> a field. The user does not have to enable payloads for a field, this is done 
> automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens 
> carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if 
> payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the 
> ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A 
> payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the 
> PositionDelta is shifted one bit. The lowest bit is used to indicate whether 
> the length of the following payload is stored explicitly. If not, i. e. the 
> bit is false, then the payload has the same length as the payload of the 
> previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at 
> every skip point has to be known. Therefore the payload length is also stored 
> in the skip list located in the FreqFile. Here the same-length compression is 
> also used: The lowest bit of DocSkip is used to indicate if the payload 
> length is stored for a SkipDatum or if the length is the same as in the last 
> SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() 
> then only the position and the payload length is loaded from the ProxFile. If 
> the user calls getPayload() then the payload is actually loaded. If 
> getPayload() is not called before nextPosition() is called again, then the 
> payload data is just skipped.
>   
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of 
> the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then 
> payloads are enabled for the corresponding field. 
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt 
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current 
> occurrence in the document and the previous occurrence (or zero, if this is 
> the first   occurrence in this document).
>   
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current 
> occurrence in the document and the previous occurrence. If PositionDelta is 
> odd, then PayloadLength is stored. If PositionDelta is even, then the length 
> of the current payload equals the length of the previous payload and thus 
> PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in 
> TermFreqs. Document numbers are represented as differences from the previous 
> value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document 
> in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is 
> even, then the length of the payload at the current skip point equals the 
> length of the payload at the last skip point and thus PayloadLength is 
> omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space 
> overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then 
> the length only has to be stored once for every term. This should be a common 
> case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much 
> space because we benefit again from the same-length-compression since we only 
> have to store the length zero for the empty payloads once per term.
> All unit tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to