[ https://issues.apache.org/jira/browse/LUCENE-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Simon Willnauer updated LUCENE-2742: ------------------------------------ Attachment: LUCENE-2742.patch Here is a first patch - all tests pass. I changed the CodecProvider interface slightly to be able to hold perField codecs as well as a default perField codec. For simplicity users can not register their codec directly though the Fieldable interface. Internally I added a CodecInfo which handles all the ordering and registration per segment / field. For consistency I bound CodecInfo to FieldInfos since we are now operating per field. A codec can only be assigned once, at the first time we see the codec during FieldInfos creation. there is a nocommit on Fieldable since it doesn't have javadoc but lets iterate first to see if we wanna go that path - it seems close. > Enable native per-field codec support > -------------------------------------- > > Key: LUCENE-2742 > URL: https://issues.apache.org/jira/browse/LUCENE-2742 > Project: Lucene - Java > Issue Type: Improvement > Components: Index, Store > Affects Versions: 4.0 > Reporter: Simon Willnauer > Assignee: Simon Willnauer > Fix For: 4.0 > > Attachments: LUCENE-2742.patch > > > Currently the codec name is stored for every segment and PerFieldCodecWrapper > is used to enable codecs per fields which has recently brought up some issues > (LUCENE-2740 and LUCENE-2741). When a codec name is stored lucene does not > respect the actual codec used to encode a fields postings but rather the > "top-level" Codec in such a case the name of the top-level codec is > "PerField" instead of "Pulsing" or "Standard" etc. The way this composite > pattern works make the indexing part of codecs simpler but also limits its > capabilities. By recoding the top-level codec in the segments file we rely on > the user to "configure" the PerFieldCodecWrapper correctly to open a > SegmentReader. If a fields codec has changed in the meanwhile we won't be > able to open the segment. > The issues LUCENE-2741 and LUCENE-2740 are actually closely related to the > way PFCW is implemented right now. PFCW blindly creates codecs per field on > request and at the same time doesn't have any control over the file naming > nor if a two codec instances are created for two distinct fields even if the > codec instance is the same. If so FieldsConsumer will throw an exception > since the files it relies on are already created. > Having PerFieldCodecWrapper AND a CodecProvider overcomplicates things IMO. > In order to use per field codec a user should on the one hand register its > custom codecs AND needs to build a PFCW which needs to be maintained in the > "user-land" an must not change incompatible once a new IW of IR is created. > What I would expect from Lucene is to enable me to register a codec in a > provider and then tell the Field which codec it should use for indexing. For > reading lucene should determ the codec automatically once a segment is > opened. if the codec is not available in the provider that is a different > story. Once we instantiate the composite codec in SegmentsReader we only have > the codecs which are really used in this segment for free which in turn > solves LUCENE-2740. > Yet, instead of relying on the user to configure PFCW I suggest to move > composite codec functionality inside the core an record the distinct codecs > per segment in the segments info. We only really need the distinct codecs > used in that segment since the codec instance should be reused to prevent > additional files to be created. Lets say we have the follwing codec mapping : > {noformat} > field_a:Pulsing > field_b:Standard > field_c:Pulsing > {noformat} > then we create the following mapping: > {noformat} > SegmentInfo: > [Pulsing, Standard] > PerField: > [field_a:0, field_b:1, field_c:0] > {noformat} > that way we can easily determ which codec is used for which field an build > the composite - codec internally on opening SegmentsReader. This ordering has > another advantage, if like in that case pulsing and standard use really the > same type of files we need a way to distinguish the used files per codec > within a segment. We can in turn pass the codec's ord (implicit in the > SegmentInfo) to the FieldConsumer on creation to create files with > segmentname_ord.ext (or something similar). This solvel LUCENE-2741). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org