[ http://issues.apache.org/jira/browse/LUCENE-755?page=comments#action_12460647 ] Michael Busch commented on LUCENE-755: --------------------------------------
> Great patch, Michael, and something that will come in handy for a lot of > people. I can vouch it applies cleanly and all the tests pass. Cool, thanks for trying it out, Grant! :-) > Now I am not sure I am totally understanding everything just yet so the > following is thinking aloud, but bear with me. > One of the big unanswered questions (besides how this fits into the whole > flexible indexing scheme as discussed on the Payloads and > Flexible indexing threads on java-dev) at this point for me is: how do we > expose/integrate this into the scoring side of the equation? It seems > we would need some interfaces that hook into the scoring mechanism so that > people can define what all these payloads are actually used > for, or am I missing something? Yet the TermScorer takes in the TermDocs, so > it doesn't yet have access to the payloads (although this is > easily remedied since we have access to the TermPositions when we construct > TermScorer.) Span Queries could easily be extended to > include payload information since they use the TermPositions, which would be > useful for post-processing algorithms. I would say it really depends on the use case of the payloads. For example XML search: here payloads can be used to store depths information of terms. An extended Span class could then take the depth information into account for query evaluation. As you pointed out the span classes already have easy access to the payloads since they use TermPositions, so to implement such a subclass should be fairly simple. > I can imagine an interface that you would have to be set on the Query/Scorer > (and inherited unless otherwise set???). The default > implementation would be to ignore any payload, I suppose. We could also add a > callback in the Similarity mechanism, something like: > > float calculatePayloadFactor(byte[] payload); > or > float calculatePayloadFactor(Term term, byte[] payload); > > Then this factor could be added/multiplied into the term score or whatever > other scorers use it?????? > > Is this making any sense? I believe the case you're describing here is per-term norms/boosts? Yah I think this would work and you are right, the Scorers have to have access to TermPositions, TermDocs is not sufficient. So yes, it would be nice if TermScorer would use TermPositions instead of TermDocs. I just opened LUCENE-761, which changes SegmentTermPositions to clone the proxStream lazily at the first time nextPosition() is called. Then the costs for creating TermDocs and TermPositions are the same and together with lazy prox skipping (LUCENE-687) there's no reason anymore to not use TermPositions. However, as currently discussed on java-dev, per-term boosts could also be part of a new posting format in the flexible index scheme and thus not stored in the payloads. So in general this patch doesn't add yet a new search feature to Lucene, it rather opens the door for new features in the future. The way to add such a new feature is then: 1) Write an analyzer that provides data neccessary for the new feature and produces Tokens with payloads containing these data. 2) Write/extend a Scorer that has access to TermPositions and makes use of the data in the payloads for matching or scoring or both. > Payloads > -------- > > Key: LUCENE-755 > URL: http://issues.apache.org/jira/browse/LUCENE-755 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Michael Busch > Assigned To: Michael Busch > Attachments: payloads.patch > > > This patch adds the possibility to store arbitrary metadata (payloads) > together with each position of a term in its posting lists. A while ago this > was discussed on the dev mailing list, where I proposed an initial design. > This patch has a much improved design with modifications, that make this new > feature easier to use and more efficient. > A payload is an array of bytes that can be stored inline in the ProxFile > (.prx). Therefore this patch provides low-level APIs to simply store and > retrieve byte arrays in the posting lists in an efficient way. > API and Usage > ------------------------------ > The new class index.Payload is basically just a wrapper around a byte[] array > together with int variables for offset and length. So a user does not have to > create a byte array for every payload, but can rather allocate one array for > all payloads of a document and provide offset and length information. This > reduces object allocations on the application side. > In order to store payloads in the posting lists one has to provide a > TokenStream or TokenFilter that produces Tokens with payloads. I added the > following two methods to the Token class: > /** Sets this Token's payload. */ > public void setPayload(Payload payload); > > /** Returns this Token's payload. */ > public Payload getPayload(); > In order to retrieve the data from the index the interface TermPositions now > offers two new methods: > /** Returns the payload length of the current term position. > * This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for > * the first time. > * > * @return length of the current payload in number of bytes > */ > int getPayloadLength(); > > /** Returns the payload data of the current term position. > * This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for > * the first time. > * This method must not be called more than once after each call > * of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loaded > lazily, > * so if the payload data for the current position is not needed, > * this method may not be called at all for performance reasons. > * > * @param data the array into which the data of this payload is to be > * stored, if it is big enough; otherwise, a new byte[] array > * is allocated for this purpose. > * @param offset the offset in the array into which the data of this payload > * is to be stored. > * @return a byte[] array containing the data of this payload > * @throws IOException > */ > byte[] getPayload(byte[] data, int offset) throws IOException; > Furthermore, this patch indroduces the new method > IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was > only a writeBytes()-method without an offset argument. > Implementation details > ------------------------------ > - One field bit in FieldInfos is used to indicate if payloads are enabled for > a field. The user does not have to enable payloads for a field, this is done > automatically: > * The DocumentWriter enables payloads for a field, if one ore more Tokens > carry payloads. > * The SegmentMerger enables payloads for a field during a merge, if > payloads are enabled for that field in one or more segments. > - Backwards compatible: If payloads are not used, then the formats of the > ProxFile and FreqFile don't change > - Payloads are stored inline in the posting list of a term in the ProxFile. A > payload of a term occurrence is stored right after its PositionDelta. > - Same-length compression: If payloads are enabled for a field, then the > PositionDelta is shifted one bit. The lowest bit is used to indicate whether > the length of the following payload is stored explicitly. If not, i. e. the > bit is false, then the payload has the same length as the payload of the > previous term occurrence. > - In order to support skipping on the ProxFile the length of the payload at > every skip point has to be known. Therefore the payload length is also stored > in the skip list located in the FreqFile. Here the same-length compression is > also used: The lowest bit of DocSkip is used to indicate if the payload > length is stored for a SkipDatum or if the length is the same as in the last > SkipDatum. > - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() > then only the position and the payload length is loaded from the ProxFile. If > the user calls getPayload() then the payload is actually loaded. If > getPayload() is not called before nextPosition() is called again, then the > payload data is just skipped. > > Changes of file formats > ------------------------------ > - FieldInfos (.fnm) > The format of the .fnm file does not change. The only change is the use of > the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then > payloads are enabled for the corresponding field. > - ProxFile (.prx) > ProxFile (.prx) --> <TermPositions>^TermCount > TermPositions --> <Positions>^DocFreq > Positions --> <PositionDelta, Payload?>^Freq > Payload --> <PayloadLength?, PayloadData> > PositionDelta --> VInt > PayloadLength --> VInt > PayloadData --> byte^PayloadLength > For payloads disabled (unchanged): > PositionDelta is the difference between the position of the current > occurrence in the document and the previous occurrence (or zero, if this is > the first occurrence in this document). > > For Payloads enabled: > PositionDelta/2 is the difference between the position of the current > occurrence in the document and the previous occurrence. If PositionDelta is > odd, then PayloadLength is stored. If PositionDelta is even, then the length > of the current payload equals the length of the previous payload and thus > PayloadLength is omitted. > - FreqFile (.frq) > SkipDatum --> DocSkip, PayloadLength?, FreqSkip, ProxSkip > PayloadLength --> VInt > For payloads disabled (unchanged): > DocSkip records the document number before every SkipInterval th document in > TermFreqs. Document numbers are represented as differences from the previous > value in the sequence. > For payloads enabled: > DocSkip/2 records the document number before every SkipInterval th document > in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is > even, then the length of the payload at the current skip point equals the > length of the payload at the last skip point and thus PayloadLength is > omitted. > This encoding is space efficient for different use cases: > * If only some fields of an index have payloads, then there's no space > overhead for the fields with payloads disabled. > * If the payloads of consecutive term positions have the same length, then > the length only has to be stored once for every term. This should be a common > case, because users probably use the same format for all payloads. > * If only a few terms of a field have payloads, then we don't waste much > space because we benefit again from the same-length-compression since we only > have to store the length zero for the empty payloads once per term. > All unit tests pass. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]