[jira] [Commented] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding

Han Jiang (JIRA) Fri, 16 Aug 2013 18:44:15 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742787#comment-13742787
 ]


Han Jiang commented on LUCENE-5179:
-----------------------------------

bq. Is it for real back compat or for "impersonation" ?
bq. Real back-compat (reader can read the old index format using the new APIs) 
should work fine, I think?

Yes, this should be 'impersonation', but actually the back-compat I mentioned 
is a weak requirement.
I'm not happy with this revert as well, so let's see if we can do something to 
hack it! :)

The strong requirement is, if we need pulsing work with the new API, there 
should be something to tell pulsing how to encode each term.

Ideally pulsing should tell term dict longsSize=0, while maintaining wrapped 
PF's longsSize.

The calling chain is:

{noformat}
 termdict ~~finishTermA(long[0], byte[]...)~> pulsing ~~finishTermB(long[3], 
byte[]...)~> wrappedPF
{noformat}

Take the terms=[ "a" "a1" ... ] example, when term "b" is finished:

the wrappedPF will fill long[] and byte[] with its metatdata, and pulsing will 
instead fills byte[]
as its 'fake' metadata. When term is not inlined, pulsing will have to encode 
wrapped PF's long[] into byte[],
but its too early! Since term "b" should be delta-encoded with term "a", and 
pulsing will never know this...

If we only need pulsing to work, there is a trade off: the pulsing returns 
wrapped PF's longsSize,
and term dict can do the buffering. For Lucene41Pulsing with position+payloads, 
we'll have to write 3 zero VLong,
along with the pulsing byte[] for an inlined term... and it's not actually 
'pulsing' then.




                
> Refactoring on PostingsWriterBase for delta-encoding
> ----------------------------------------------------
>
>                 Key: LUCENE-5179
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5179
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Han Jiang
>            Assignee: Han Jiang
>             Fix For: 5.0, 4.5
>
>         Attachments: LUCENE-5179.patch
>
>
> A further step from LUCENE-5029.
> The short story is, previous API change brings two problems:
> * it somewhat breaks backward compatibility: although we can still read old 
> format,
>   we can no longer reproduce it;
> * pulsing codec have problem with it.
> And long story...
> With the change, current PostingsBase API will be like this:
> * term dict tells PBF we start a new term (via startTerm());
> * PBF adds docs, positions and other postings data;
> * term dict tells PBF all the data for current term is completed (via 
> finishTerm()),
>   then PBF returns the metadata for current term (as long[] and byte[]);
> * term dict might buffer all the metadata in an ArrayList. when all the term 
> is collected,
>   it then decides how those metadata will be located on disk.
> So after the API change, PBF no longer have that annoying 'flushTermBlock', 
> and instead
> term dict maintains the <term, metadata> list.
> However, for each term we'll now write long[] blob before byte[], so the 
> index format is not consistent with pre-4.5.
> like in Lucne41, the metadata can be written as longA,bytesA,longB, but now 
> we have to write as longA,longB,bytesA.
> Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is 
> delta-encoded, after all
> PulsingPostingsWriter is only a PBF.
> For example, we have terms=["a", "a1", "a2", "b", "b1" "b2"] and 
> itemsInBlock=2, so theoretically
> we'll finally have three blocks in BTTR: ["a" "b"]  ["a1" "a2"]  ["b1" "b2"], 
> with this
> approach, the metadata of term "b" is delta encoded base on metadata of "a". 
> but when term dict tells
> PBF to finishTerm("b"), it might silly do the delta encode base on term "a2".
> So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput 
> out, FieldInfo, TermState, boolean absolute)',
> so that during metadata flush, we can control how current term is written? 
> And the term dict will buffer TermState, which
> implicitly holds metadata like we do in PBReader side.
> For example, if we want to reproduce old lucene41 format , we can simple set 
> longsSize==0, then PBF
> writes the old format (longA,bytesA,longB) to DataOutput, and the compatible 
> issue is solved.
> For pulsing codec, it will also be able to tell lower level how to encode 
> metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding

Reply via email to