Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Michael Busch Tue, 13 Oct 2009 09:44:01 -0700

Shall we first remove the remaining deprecations from the indexerpackage? There are not many more left, shouldn't be much work.


 Michael


On 10/13/09 5:47 AM, Michael McCandless wrote:

OK I will cut a branch&  commit Mark's last patch onto it, unless
anyone has objections soonish...

I'll also branch (twig?) the back compat branch so we can commit the
patch there as well.

Mike

On Mon, Oct 12, 2009 at 10:50 PM, Mark Miller<[email protected]>  wrote:

SVN is about as good at merging branches as any of us are with a patch
and trunk unfortunately. But that can still be somewhat more convenient
than all these huge patches, with different people at different stages.

Depends on how many people end up working on this though. Any more than
2, and I think the branch has got to be worth it.

 From my perspective, it doesn't make any of the merging process any
easier - but it can be easier than juggling all these patches - you have
a central code base that can always be targeted for current merging.

Michael Busch wrote:

I think it's supposed to work pretty good - though I have no personal
experience with merging branches with svn.

I think we should try it - then we'll know! :)

  Michael

On 10/12/09 12:32 PM, Michael McCandless (JIRA) wrote:

      [
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764799#action_12764799
]

Michael McCandless commented on LUCENE-1458:
--------------------------------------------

bq. Shall we create a flexible-indexing branch and commit this?

I think this is a good idea.

But I haven't played heavily w/ svn&    branching.  EG if we branch
now, and trunk moves fast (which it still is w/ deprecation
removals), are we going to have conflicts?  Or... is svn good about
merging branches?

Further steps towards flexible indexing
---------------------------------------

                  Key: LUCENE-1458
                  URL: https://issues.apache.org/jira/browse/LUCENE-1458
              Project: Lucene - Java
           Issue Type: New Feature
           Components: Index
     Affects Versions: 2.9
             Reporter: Michael McCandless
             Assignee: Michael McCandless
             Priority: Minor
          Attachments: LUCENE-1458-back-compat.patch,
LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
LUCENE-1458.tar.bz2


I attached a very rough checkpoint of my current patch, to get early
feedback.  All tests pass, though back compat tests don't pass due to
changes to package-private APIs plus certain bugs in tests that
happened to work (eg call TermPostions.nextPosition() too many times,
which the new API asserts against).
[Aside: I think, when we commit changes to package-private APIs such
that back-compat tests don't pass, we could go back, make a branch on
the back-compat tag, commit changes to the tests to use the new
package private APIs on that branch, then fix nightly build to use the
tip of that branch?o]
There's still plenty to do before this is committable! This is a
rather large change:
    * Switches to a new more efficient terms dict format.  This still
      uses tii/tis files, but the tii only stores term&    long offset
      (not a TermInfo).  At seek points, tis encodes term&    freq/prox
      offsets absolutely instead of with deltas delta.  Also, tis/tii
      are structured by field, so we don't have to record field number
      in every term.
.
      On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
      ->    0.64 MB) and tis file is 9% smaller (75.5 MB ->    68.5 MB).
.
      RAM usage when loading terms dict index is significantly less
      since we only load an array of offsets and an array of String (no
      more TermInfo array).  It should be faster to init too.
.
      This part is basically done.
    * Introduces modular reader codec that strongly decouples terms dict
      from docs/positions readers.  EG there is no more TermInfo used
      when reading the new format.
.
      There's nice symmetry now between reading&    writing in the codec
      chain -- the current docs/prox format is captured in:
{code}
FormatPostingsTermsDictWriter/Reader
FormatPostingsDocsWriter/Reader (.frq file) and
FormatPostingsPositionsWriter/Reader (.prx file).
{code}
      This part is basically done.
    * Introduces a new "flex" API for iterating through the fields,
      terms, docs and positions:
{code}
FieldProducer ->    TermsEnum ->    DocsEnum ->    PostingsEnum
{code}
      This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
      old API on top of the new API to keep back-compat.

Next steps:
    * Plug in new codecs (pulsing, pfor) to exercise the modularity /
      fix any hidden assumptions.
    * Expose new API out of IndexReader, deprecate old API but emulate
      old API on top of new one, switch all core/contrib users to the
      new API.
    * Maybe switch to AttributeSources as the base class for TermsEnum,
      DocsEnum, PostingsEnum -- this would give readers API flexibility
      (not just index-file-format flexibility).  EG if someone wanted
      to store payload at the term-doc level instead of
      term-doc-position level, you could just add a new attribute.
    * Test performance&    iterate.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


--
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Reply via email to