[ 
https://issues.apache.org/jira/browse/LUCENE-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262179#comment-13262179
 ] 

Robert Muir commented on LUCENE-2946:
-------------------------------------

we can go into detail, but we can't do bit-for-bit with even StandardCodec... 
its simply not feasible.

For the simple metadata files, and even stored fields and postings its fine 
(for now), but
e.g. going bit-for-bit with packed integer compression of docvalues isnt very 
realistic,
nor is try to explain how FSTs for blocktree are serialized.

Hell at my current pace I'll just be happy if we can even document all the 
different docvalues types 
give some general idea how they are encoded, or give a high-level explanation 
of the terms dictionary.

Even the existing "simple" metadata files are a pretty serious effort because 
most of the existing
docs are wildly out of date.

I figure all of this is ok (i'm heavy committing) since we essentially have 
nothing today: just out
of date useless docs :)

                
> change file format documentation from "bit-for-bit" to highlevel
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2946
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2946
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: general/website
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>
> While reviewing website docs in LUCENE-2924,
> I noticed the the existing fileformats is going to be pretty hopeless for 4.0.
> Before it described the format "bit-for-bit", but with flexible indexing this 
> is 
> somewhat silly (and who really wants a bit-for-bit explanation of some of the 
> new formats!)
> I think it would be much better to give a high-level overview, perhaps 
> linking to javadocs or
> even source code for the low-level details. 
> We probably should delay this until 4.0 is really close in sight (since 
> things are changing so fast) but we can go ahead and think about it some now.
> For example:
> * high level explanation of what a codec is, and the various subsystems one 
> is usually composed of (terms index, terms data, skiplist impl, postings 
> impl, etc). We can reiterate that you can make your own, and hopefully this 
> kind of documentation will actually encourage that.
> * high level explanation of what StandardCodec is "composed of". For example 
> assume its Variable Terms Index, Block Terms Reader, PForDelta docs and freqs 
> and Simple64 positions. I think really this is the only codec we should try 
> to "diagram" in any way.
> * high level explanation (probably with links) of some of the components. For 
> example we could explain what the purpose of a Terms Index is, and that this 
> implementation uses a finite state transducer to find the terms block for a 
> given term. In this case maybe we have an image now that Dawid made the toDot 
> useful.
> * high level explanation (probably with links) of some of the compression 
> algorithms. For example, we could explain the basics of the available 
> algorithms we have (vbyte/simple/for/pfor/...) and what their advantages and 
> disadvantages are.
> Some of the things i mentioned here are probably optional, for instance I 
> think its "enough" to give a high-level overview of StandardCodec, but I 
> can't help but think that explaining some of the architecture will be useful 
> for new developers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to