[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)

Marvin Humphrey (JIRA) Wed, 08 Apr 2009 09:08:41 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697076#action_12697076
 ]


Marvin Humphrey commented on LUCENE-1231:
-----------------------------------------

FWIW, I think priority for document fetching should be to optimize for search
clusters where you have separate index servers and document/excerpt servers.
Under such a system, it's clear that you'd want to use the standard stored
fields.

To my mind, column stride fields are more of a search tool -- an extension of
the field caches and the "flexible indexing" concept -- and I don't think the
design should be encumbered or deployment slowed so that they can perform
double duty.

However, I think column-stride fields may be more of an imperative for Lucy
than Lucene.  In Lucy, our sort caches will be either column-stride or
variable width and will be written out at index-time -- mmap a
one-value-per-doc column-stride file, and voila: instant sort cache.

> Loading a mix of cached/uncached fields is massive win

I don't quite understand why that would be the case.  Presuming a cold OS
cache, the big cost is the .fdt file disk seek.  Once you're there, how much
of a difference is it to read the field off disk as opposed to reading it from
the cache?

> it becomes even more massive if all required fields happen to be cached.

I can see that.  However, the gain won't be all that significant for systems
where the index fits into RAM, or when the persistant storage device is an
SSD.  And of course a different caching strategy altogether (popular document
caching) is best for dedicated doc servers.


> Column-stride fields (aka per-document Payloads)
> ------------------------------------------------
>
>                 Key: LUCENE-1231
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1231
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.0
>
>
> This new feature has been proposed and discussed here:
> http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results
> Currently it is possible in Lucene to store data as stored fields or as 
> payloads.
> Stored fields provide good performance if you want to load all fields for one
> document, because this is an sequential I/O operation.
> If you however want to load the data from one field for a large number of 
> documents, then stored fields perform quite badly, because lot's of I/O seeks 
> might have to be performed. 
> A better way to do this is using payloads. By creating a "special" posting 
> list
> that has one posting with payload for each document you can "simulate" a 
> column-
> stride field. The performance is significantly better compared to stored 
> fields,
> however still not optimal. The reason is that for each document the freq 
> value,
> which is in this particular case always 1, has to be decoded, also one 
> position
> value, which is always 0, has to be loaded.
> As a solution we want to add real column-stride fields to Lucene. A possible
> format for the new data structure could look like this (CSD stands for column-
> stride data, once we decide for a final name for this feature we can change 
> this):
> CSDList --> FixedLengthList | <VariableLengthList, SkipList> 
> FixedLengthList --> <Payload>^SegSize 
> VariableLengthList --> <DocDelta, PayloadLength?, Payload> 
> Payload --> Byte^PayloadLength 
> PayloadLength --> VInt 
> SkipList --> see frq.file
> We distinguish here between the fixed length and the variable length cases. To
> allow flexibility, Lucene could automatically pick the "right" data 
> structure. 
> This could work like this: When the DocumentsWriter writes a segment it 
> checks 
> whether all values of a field have the same length. If yes, it stores them as 
> FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger 
> merges two or more segments it checks if all segments have a FixedLengthList 
> with the same length for a column-stride field. If not, it writes a 
> VariableLengthList to the new segment. 
> Once this feature is implemented, we should think about making the column-
> stride fields updateable, similar to the norms. This will be a very powerful
> feature that can for example be used for low-latency tagging of documents.
> Other use cases:
> - replace norms
> - allow to store boost values separately from norms
> - as input for the FieldCache, thus providing significantly improved loading
> performance (see LUCENE-831)
> Things that need to be done here:
> - decide for a name for this feature :) - I think "column-stride fields" was
> liked better than "per-document payloads"
> - Design an API for this feature. We should keep in mind here that these 
> fields are supposed to be updateable.
> - Define datastructures.
> I would like to get this feature into 2.4. Feedback about the open questions
> is very welcome so that we can finalize the design soon and start 
> implementing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)

Reply via email to