[ https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697470#action_12697470 ]
Michael McCandless commented on LUCENE-1231: -------------------------------------------- {quote} To my mind, column stride fields are more of a search tool - an extension of the field caches and the "flexible indexing" concept {quote} I don't think of CSF's as an extension to FieldCache; I think of them as an alternate (much more efficient than uninversion) underlying store that FieldCache can use to retrieve values. Ie the way you'll access a CSF in Lucene will be through the [new in LUCENE-831] FieldCache API. {quote} However, I think column-stride fields may be more of an imperative for Lucy than Lucene. In Lucy, our sort caches will be either column-stride or variable width and will be written out at index-time - mmap a one-value-per-doc column-stride file, and voila: instant sort cache. {quote} I think column or row stride, vs fixed or variable width storage, are orthogonal issues? EG column-stride fields (in Lucene) could very well be variable width storage (eg for storing String values). {quote} I can see that. However, the gain won't be all that significant for systems where the index fits into RAM, or when the persistant storage device is an SSD. {quote} SSDs, as fast as they are, are still orders of magnitude slower than RAM (assuming of course the OS hasn't swapped your RAM out to your SSD ;). > Column-stride fields (aka per-document Payloads) > ------------------------------------------------ > > Key: LUCENE-1231 > URL: https://issues.apache.org/jira/browse/LUCENE-1231 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 3.0 > > > This new feature has been proposed and discussed here: > http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results > Currently it is possible in Lucene to store data as stored fields or as > payloads. > Stored fields provide good performance if you want to load all fields for one > document, because this is an sequential I/O operation. > If you however want to load the data from one field for a large number of > documents, then stored fields perform quite badly, because lot's of I/O seeks > might have to be performed. > A better way to do this is using payloads. By creating a "special" posting > list > that has one posting with payload for each document you can "simulate" a > column- > stride field. The performance is significantly better compared to stored > fields, > however still not optimal. The reason is that for each document the freq > value, > which is in this particular case always 1, has to be decoded, also one > position > value, which is always 0, has to be loaded. > As a solution we want to add real column-stride fields to Lucene. A possible > format for the new data structure could look like this (CSD stands for column- > stride data, once we decide for a final name for this feature we can change > this): > CSDList --> FixedLengthList | <VariableLengthList, SkipList> > FixedLengthList --> <Payload>^SegSize > VariableLengthList --> <DocDelta, PayloadLength?, Payload> > Payload --> Byte^PayloadLength > PayloadLength --> VInt > SkipList --> see frq.file > We distinguish here between the fixed length and the variable length cases. To > allow flexibility, Lucene could automatically pick the "right" data > structure. > This could work like this: When the DocumentsWriter writes a segment it > checks > whether all values of a field have the same length. If yes, it stores them as > FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger > merges two or more segments it checks if all segments have a FixedLengthList > with the same length for a column-stride field. If not, it writes a > VariableLengthList to the new segment. > Once this feature is implemented, we should think about making the column- > stride fields updateable, similar to the norms. This will be a very powerful > feature that can for example be used for low-latency tagging of documents. > Other use cases: > - replace norms > - allow to store boost values separately from norms > - as input for the FieldCache, thus providing significantly improved loading > performance (see LUCENE-831) > Things that need to be done here: > - decide for a name for this feature :) - I think "column-stride fields" was > liked better than "per-document payloads" > - Design an API for this feature. We should keep in mind here that these > fields are supposed to be updateable. > - Define datastructures. > I would like to get this feature into 2.4. Feedback about the open questions > is very welcome so that we can finalize the design soon and start > implementing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org