[
https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582442#action_12582442
]
Doug Cutting commented on LUCENE-1231:
--------------------------------------
So there are a number of features these fields would have that differ from
other fields:
- no freq
- no positions
- non-sparse representation
- binary values (is this different from payloads?)
- updateable
My question is whether it is best to bundle these together as a new kind of
field, or add these as optional features of ordinary fields, or some
combination. There are a certain bundles that may work well together: e.g., a
dense array of fixed-size, updateable binary values w/o freqs or positions.
And not all combinations may be sensible or easy to implement. But most of
these would also be useful ala carte too, e.g., no-freqs, no-positions and
(perhaps) updateable.
BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a
reasonable API for updating sparse fields.
> Column-stride fields (aka per-document Payloads)
> ------------------------------------------------
>
> Key: LUCENE-1231
> URL: https://issues.apache.org/jira/browse/LUCENE-1231
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Index
> Reporter: Michael Busch
> Assignee: Michael Busch
> Priority: Minor
> Fix For: 2.4
>
>
> This new feature has been proposed and discussed here:
> http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results
> Currently it is possible in Lucene to store data as stored fields or as
> payloads.
> Stored fields provide good performance if you want to load all fields for one
> document, because this is an sequential I/O operation.
> If you however want to load the data from one field for a large number of
> documents, then stored fields perform quite badly, because lot's of I/O seeks
> might have to be performed.
> A better way to do this is using payloads. By creating a "special" posting
> list
> that has one posting with payload for each document you can "simulate" a
> column-
> stride field. The performance is significantly better compared to stored
> fields,
> however still not optimal. The reason is that for each document the freq
> value,
> which is in this particular case always 1, has to be decoded, also one
> position
> value, which is always 0, has to be loaded.
> As a solution we want to add real column-stride fields to Lucene. A possible
> format for the new data structure could look like this (CSD stands for column-
> stride data, once we decide for a final name for this feature we can change
> this):
> CSDList --> FixedLengthList | <VariableLengthList, SkipList>
> FixedLengthList --> <Payload>^SegSize
> VariableLengthList --> <DocDelta, PayloadLength?, Payload>
> Payload --> Byte^PayloadLength
> PayloadLength --> VInt
> SkipList --> see frq.file
> We distinguish here between the fixed length and the variable length cases. To
> allow flexibility, Lucene could automatically pick the "right" data
> structure.
> This could work like this: When the DocumentsWriter writes a segment it
> checks
> whether all values of a field have the same length. If yes, it stores them as
> FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger
> merges two or more segments it checks if all segments have a FixedLengthList
> with the same length for a column-stride field. If not, it writes a
> VariableLengthList to the new segment.
> Once this feature is implemented, we should think about making the column-
> stride fields updateable, similar to the norms. This will be a very powerful
> feature that can for example be used for low-latency tagging of documents.
> Other use cases:
> - replace norms
> - allow to store boost values separately from norms
> - as input for the FieldCache, thus providing significantly improved loading
> performance (see LUCENE-831)
> Things that need to be done here:
> - decide for a name for this feature :) - I think "column-stride fields" was
> liked better than "per-document payloads"
> - Design an API for this feature. We should keep in mind here that these
> fields are supposed to be updateable.
> - Define datastructures.
> I would like to get this feature into 2.4. Feedback about the open questions
> is very welcome so that we can finalize the design soon and start
> implementing.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]