[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)

Earwin Burrfoot (JIRA) Wed, 08 Apr 2009 04:00:46 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696977#action_12696977
 ]


Earwin Burrfoot commented on LUCENE-1231:
-----------------------------------------

I can share my design for doc loading, if anybody needs it:

public interface FieldCache {
  DocLoader loader(FieldInfo<?>... fields);
  ....
}

public interface DocLoader {
  void load(Doc doc);
  <T> T value(FieldInfo<T> field);
}

Doc is my analog for ScoreDoc, for these purporses it is completely identical
FieldInfos are constants defined like UserFields.EMAIL, they hold the type for 
field, its name, indexing method, whether it is cached or not and the way it is 
cached. Two synthetic fields exist - LUCENE_ID and SCORE, they allow to use 
same api for anything field-related.

Workflow looks like this:

// I create a loader. Fields are checked against the cache, for those that 
aren't cached I create a FieldSelector
loader = searcher.fieldCache().loader(concat(payloadFields, ID, DOCUMENT_TYPE, 
sortBy.field));

// Then for each document I'm going to send in response for search request I 
select this document
// an indexReader.document(fieldSelector) happens here if there are any 
uncached fields
loader.load(doc);

// Then I extract the values I need. Cached ones arrive from the cache, 
uncached are decoded from Document retrieved in previous step
hit = new Hit(loader.value(ID), loader.value(DOCUMENT_TYPE), 
loader.value(sortBy.field)) // etc, etc


Having a single API to retrieve values regardless of the way they are 
stored/cached is very handy. Loading a mix of stored/column-stride (if I 
correctly understand what are they) fields is pointless, you're more likely to 
lose performance than to gain it. Loading a mix of cached/uncached fields is 
massive win, it becomes even more massive if all required fields happen to be 
cached.

> Column-stride fields (aka per-document Payloads)
> ------------------------------------------------
>
>                 Key: LUCENE-1231
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1231
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.0
>
>
> This new feature has been proposed and discussed here:
> http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results
> Currently it is possible in Lucene to store data as stored fields or as 
> payloads.
> Stored fields provide good performance if you want to load all fields for one
> document, because this is an sequential I/O operation.
> If you however want to load the data from one field for a large number of 
> documents, then stored fields perform quite badly, because lot's of I/O seeks 
> might have to be performed. 
> A better way to do this is using payloads. By creating a "special" posting 
> list
> that has one posting with payload for each document you can "simulate" a 
> column-
> stride field. The performance is significantly better compared to stored 
> fields,
> however still not optimal. The reason is that for each document the freq 
> value,
> which is in this particular case always 1, has to be decoded, also one 
> position
> value, which is always 0, has to be loaded.
> As a solution we want to add real column-stride fields to Lucene. A possible
> format for the new data structure could look like this (CSD stands for column-
> stride data, once we decide for a final name for this feature we can change 
> this):
> CSDList --> FixedLengthList | <VariableLengthList, SkipList> 
> FixedLengthList --> <Payload>^SegSize 
> VariableLengthList --> <DocDelta, PayloadLength?, Payload> 
> Payload --> Byte^PayloadLength 
> PayloadLength --> VInt 
> SkipList --> see frq.file
> We distinguish here between the fixed length and the variable length cases. To
> allow flexibility, Lucene could automatically pick the "right" data 
> structure. 
> This could work like this: When the DocumentsWriter writes a segment it 
> checks 
> whether all values of a field have the same length. If yes, it stores them as 
> FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger 
> merges two or more segments it checks if all segments have a FixedLengthList 
> with the same length for a column-stride field. If not, it writes a 
> VariableLengthList to the new segment. 
> Once this feature is implemented, we should think about making the column-
> stride fields updateable, similar to the norms. This will be a very powerful
> feature that can for example be used for low-latency tagging of documents.
> Other use cases:
> - replace norms
> - allow to store boost values separately from norms
> - as input for the FieldCache, thus providing significantly improved loading
> performance (see LUCENE-831)
> Things that need to be done here:
> - decide for a name for this feature :) - I think "column-stride fields" was
> liked better than "per-document payloads"
> - Design an API for this feature. We should keep in mind here that these 
> fields are supposed to be updateable.
> - Define datastructures.
> I would like to get this feature into 2.4. Feedback about the open questions
> is very welcome so that we can finalize the design soon and start 
> implementing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)

Reply via email to