[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)

Doug Cutting (JIRA) Wed, 26 Mar 2008 14:53:55 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582442#action_12582442
 ]


Doug Cutting commented on LUCENE-1231:
--------------------------------------

So there are a number of features these fields would have that differ from 
other fields:
 - no freq
 - no positions
 - non-sparse representation
 - binary values (is this different from payloads?)
 - updateable

My question is whether it is best to bundle these together as a new kind of 
field, or add these as optional features of ordinary fields, or some 
combination.  There are a certain bundles that may work well together: e.g., a 
dense array of fixed-size, updateable binary values w/o freqs or positions.  
And not all combinations may be sensible or easy to implement.  But most of 
these would also be useful ala carte too, e.g., no-freqs, no-positions and 
(perhaps) updateable.

BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a 
reasonable API for updating sparse fields.

> Column-stride fields (aka per-document Payloads)
> ------------------------------------------------
>
>                 Key: LUCENE-1231
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1231
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>
> This new feature has been proposed and discussed here:
> http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results
> Currently it is possible in Lucene to store data as stored fields or as 
> payloads.
> Stored fields provide good performance if you want to load all fields for one
> document, because this is an sequential I/O operation.
> If you however want to load the data from one field for a large number of 
> documents, then stored fields perform quite badly, because lot's of I/O seeks 
> might have to be performed. 
> A better way to do this is using payloads. By creating a "special" posting 
> list
> that has one posting with payload for each document you can "simulate" a 
> column-
> stride field. The performance is significantly better compared to stored 
> fields,
> however still not optimal. The reason is that for each document the freq 
> value,
> which is in this particular case always 1, has to be decoded, also one 
> position
> value, which is always 0, has to be loaded.
> As a solution we want to add real column-stride fields to Lucene. A possible
> format for the new data structure could look like this (CSD stands for column-
> stride data, once we decide for a final name for this feature we can change 
> this):
> CSDList --> FixedLengthList | <VariableLengthList, SkipList> 
> FixedLengthList --> <Payload>^SegSize 
> VariableLengthList --> <DocDelta, PayloadLength?, Payload> 
> Payload --> Byte^PayloadLength 
> PayloadLength --> VInt 
> SkipList --> see frq.file
> We distinguish here between the fixed length and the variable length cases. To
> allow flexibility, Lucene could automatically pick the "right" data 
> structure. 
> This could work like this: When the DocumentsWriter writes a segment it 
> checks 
> whether all values of a field have the same length. If yes, it stores them as 
> FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger 
> merges two or more segments it checks if all segments have a FixedLengthList 
> with the same length for a column-stride field. If not, it writes a 
> VariableLengthList to the new segment. 
> Once this feature is implemented, we should think about making the column-
> stride fields updateable, similar to the norms. This will be a very powerful
> feature that can for example be used for low-latency tagging of documents.
> Other use cases:
> - replace norms
> - allow to store boost values separately from norms
> - as input for the FieldCache, thus providing significantly improved loading
> performance (see LUCENE-831)
> Things that need to be done here:
> - decide for a name for this feature :) - I think "column-stride fields" was
> liked better than "per-document payloads"
> - Design an API for this feature. We should keep in mind here that these 
> fields are supposed to be updateable.
> - Define datastructures.
> I would like to get this feature into 2.4. Feedback about the open questions
> is very welcome so that we can finalize the design soon and start 
> implementing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)

Reply via email to