[ 
https://issues.apache.org/jira/browse/LUCENE-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2742:
------------------------------------

    Attachment: LUCENE-2742.patch

Here is a first patch - all tests pass. I changed the CodecProvider interface 
slightly to be able to hold perField codecs as well as a default perField 
codec. For simplicity users can not register their codec directly though the 
Fieldable interface. Internally I added a CodecInfo which handles all the 
ordering and registration per segment / field. For consistency I bound 
CodecInfo to FieldInfos since we are now operating per field. A codec can only 
be assigned once, at the first time we see the codec during FieldInfos 
creation. 

there is a nocommit on Fieldable since it doesn't have javadoc but lets iterate 
first to see if we wanna go that path - it seems close. 


> Enable native per-field codec support 
> --------------------------------------
>
>                 Key: LUCENE-2742
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2742
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index, Store
>    Affects Versions: 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: 4.0
>
>         Attachments: LUCENE-2742.patch
>
>
> Currently the codec name is stored for every segment and PerFieldCodecWrapper 
> is used to enable codecs per fields which has recently brought up some issues 
> (LUCENE-2740 and LUCENE-2741). When a codec name is stored lucene does not 
> respect the actual codec used to encode a fields postings but rather the 
> "top-level" Codec in such a case the name of the top-level codec is  
> "PerField" instead of "Pulsing" or "Standard" etc. The way this composite 
> pattern works make the indexing part of codecs simpler but also limits its 
> capabilities. By recoding the top-level codec in the segments file we rely on 
> the user to "configure" the PerFieldCodecWrapper correctly to open a 
> SegmentReader. If a fields codec has changed in the meanwhile we won't be 
> able to open the segment.
> The issues LUCENE-2741 and LUCENE-2740 are actually closely related to the 
> way PFCW is implemented right now. PFCW blindly creates codecs per field on 
> request and at the same time doesn't have any control over the file naming 
> nor if a two codec instances are created for two distinct fields even if the 
> codec instance is the same. If so FieldsConsumer will throw an exception 
> since the files it relies on are already created.
> Having PerFieldCodecWrapper AND a CodecProvider overcomplicates things IMO. 
> In order to use per field codec a user should on the one hand register its 
> custom codecs AND needs to build a PFCW which needs to be maintained in the 
> "user-land" an must not change incompatible once a new IW of IR is created. 
> What I would expect from Lucene is to enable me to register a codec in a 
> provider and then tell the Field which codec it should use for indexing. For 
> reading lucene should determ the codec automatically once a segment is 
> opened. if the codec is not available in the provider that is a different 
> story. Once we instantiate the composite codec in SegmentsReader we only have 
> the codecs which are really used in this segment for free which in turn 
> solves LUCENE-2740. 
> Yet, instead of relying on the user to configure PFCW I suggest to move 
> composite codec functionality inside the core an record the distinct codecs 
> per segment in the segments info. We only really need the distinct codecs 
> used in that segment since the codec instance should be reused to prevent 
> additional files to be created. Lets say we have the follwing codec mapping :
> {noformat}
> field_a:Pulsing
> field_b:Standard
> field_c:Pulsing
> {noformat}
> then we create the following mapping:
> {noformat}
> SegmentInfo:
> [Pulsing, Standard]
> PerField:
> [field_a:0, field_b:1, field_c:0]
> {noformat}
> that way we can easily determ which codec is used for which field an build 
> the composite - codec internally on opening SegmentsReader. This ordering has 
> another advantage, if like in that case pulsing and standard use really the 
> same type of files we need a way to distinguish the used files per codec 
> within a segment. We can in turn pass the codec's ord (implicit in the 
> SegmentInfo) to the FieldConsumer on creation to create files with 
> segmentname_ord.ext (or something similar). This solvel LUCENE-2741). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to