RE: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Uwe Schindler Tue, 13 Oct 2009 08:06:44 -0700

I think the big changes in the o.a.l.search package are over... :-) - Worked
the whole day on it.


Merging branches with TortoiseSVN works really good, you can even edit the
conflicts directly in the diff view. Used it when fixing the IR/IW hell
deprecations in the BW branch.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]

> -----Original Message-----
> From: Michael McCandless [mailto:[email protected]]
> Sent: Tuesday, October 13, 2009 5:01 PM
> To: [email protected]
> Subject: Re: [jira] Commented: (LUCENE-1458) Further steps towards
> flexible indexing
> 
> Yes please!
> 
> Mike
> 
> On Tue, Oct 13, 2009 at 10:40 AM, Mark Miller <[email protected]>
> wrote:
> > I can trunk it once more if you'd like - its already pretty out of date
> :)
> >
> > If you havn't started anyway ...
> >
> >
> > Michael McCandless wrote:
> >> OK I will cut a branch & commit Mark's last patch onto it, unless
> >> anyone has objections soonish...
> >>
> >> I'll also branch (twig?) the back compat branch so we can commit the
> >> patch there as well.
> >>
> >> Mike
> >>
> >> On Mon, Oct 12, 2009 at 10:50 PM, Mark Miller <[email protected]>
> wrote:
> >>
> >>> SVN is about as good at merging branches as any of us are with a patch
> >>> and trunk unfortunately. But that can still be somewhat more
> convenient
> >>> than all these huge patches, with different people at different
> stages.
> >>>
> >>> Depends on how many people end up working on this though. Any more
> than
> >>> 2, and I think the branch has got to be worth it.
> >>>
> >>> From my perspective, it doesn't make any of the merging process any
> >>> easier - but it can be easier than juggling all these patches - you
> have
> >>> a central code base that can always be targeted for current merging.
> >>>
> >>> Michael Busch wrote:
> >>>
> >>>> I think it's supposed to work pretty good - though I have no personal
> >>>> experience with merging branches with svn.
> >>>>
> >>>> I think we should try it - then we'll know! :)
> >>>>
> >>>>  Michael
> >>>>
> >>>> On 10/12/09 12:32 PM, Michael McCandless (JIRA) wrote:
> >>>>
> >>>>>      [
> >>>>> https://issues.apache.org/jira/browse/LUCENE-
> 1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel&focusedCommentId=12764799#action_12764799
> >>>>> ]
> >>>>>
> >>>>> Michael McCandless commented on LUCENE-1458:
> >>>>> --------------------------------------------
> >>>>>
> >>>>> bq. Shall we create a flexible-indexing branch and commit this?
> >>>>>
> >>>>> I think this is a good idea.
> >>>>>
> >>>>> But I haven't played heavily w/ svn&  branching.  EG if we branch
> >>>>> now, and trunk moves fast (which it still is w/ deprecation
> >>>>> removals), are we going to have conflicts?  Or... is svn good about
> >>>>> merging branches?
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Further steps towards flexible indexing
> >>>>>> ---------------------------------------
> >>>>>>
> >>>>>>                  Key: LUCENE-1458
> >>>>>>                  URL: https://issues.apache.org/jira/browse/LUCENE-
> 1458
> >>>>>>              Project: Lucene - Java
> >>>>>>           Issue Type: New Feature
> >>>>>>           Components: Index
> >>>>>>     Affects Versions: 2.9
> >>>>>>             Reporter: Michael McCandless
> >>>>>>             Assignee: Michael McCandless
> >>>>>>             Priority: Minor
> >>>>>>          Attachments: LUCENE-1458-back-compat.patch,
> >>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
> >>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
> >>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-
> 1458.patch,
> >>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
> >>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
> >>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
> >>>>>> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
> >>>>>> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
> >>>>>> LUCENE-1458.tar.bz2
> >>>>>>
> >>>>>>
> >>>>>> I attached a very rough checkpoint of my current patch, to get
> early
> >>>>>> feedback.  All tests pass, though back compat tests don't pass due
> to
> >>>>>> changes to package-private APIs plus certain bugs in tests that
> >>>>>> happened to work (eg call TermPostions.nextPosition() too many
> times,
> >>>>>> which the new API asserts against).
> >>>>>> [Aside: I think, when we commit changes to package-private APIs
> such
> >>>>>> that back-compat tests don't pass, we could go back, make a branch
> on
> >>>>>> the back-compat tag, commit changes to the tests to use the new
> >>>>>> package private APIs on that branch, then fix nightly build to use
> the
> >>>>>> tip of that branch?o]
> >>>>>> There's still plenty to do before this is committable! This is a
> >>>>>> rather large change:
> >>>>>>    * Switches to a new more efficient terms dict format.  This
> still
> >>>>>>      uses tii/tis files, but the tii only stores term&  long offset
> >>>>>>      (not a TermInfo).  At seek points, tis encodes term&
>  freq/prox
> >>>>>>      offsets absolutely instead of with deltas delta.  Also,
> tis/tii
> >>>>>>      are structured by field, so we don't have to record field
> number
> >>>>>>      in every term.
> >>>>>> .
> >>>>>>      On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99
> MB
> >>>>>>      ->  0.64 MB) and tis file is 9% smaller (75.5 MB ->  68.5 MB).
> >>>>>> .
> >>>>>>      RAM usage when loading terms dict index is significantly less
> >>>>>>      since we only load an array of offsets and an array of String
> (no
> >>>>>>      more TermInfo array).  It should be faster to init too.
> >>>>>> .
> >>>>>>      This part is basically done.
> >>>>>>    * Introduces modular reader codec that strongly decouples terms
> dict
> >>>>>>      from docs/positions readers.  EG there is no more TermInfo
> used
> >>>>>>      when reading the new format.
> >>>>>> .
> >>>>>>      There's nice symmetry now between reading&  writing in the
> codec
> >>>>>>      chain -- the current docs/prox format is captured in:
> >>>>>> {code}
> >>>>>> FormatPostingsTermsDictWriter/Reader
> >>>>>> FormatPostingsDocsWriter/Reader (.frq file) and
> >>>>>> FormatPostingsPositionsWriter/Reader (.prx file).
> >>>>>> {code}
> >>>>>>      This part is basically done.
> >>>>>>    * Introduces a new "flex" API for iterating through the fields,
> >>>>>>      terms, docs and positions:
> >>>>>> {code}
> >>>>>> FieldProducer ->  TermsEnum ->  DocsEnum ->  PostingsEnum
> >>>>>> {code}
> >>>>>>      This replaces TermEnum/Docs/Positions.  SegmentReader emulates
> the
> >>>>>>      old API on top of the new API to keep back-compat.
> >>>>>>
> >>>>>> Next steps:
> >>>>>>    * Plug in new codecs (pulsing, pfor) to exercise the modularity
> /
> >>>>>>      fix any hidden assumptions.
> >>>>>>    * Expose new API out of IndexReader, deprecate old API but
> emulate
> >>>>>>      old API on top of new one, switch all core/contrib users to
> the
> >>>>>>      new API.
> >>>>>>    * Maybe switch to AttributeSources as the base class for
> TermsEnum,
> >>>>>>      DocsEnum, PostingsEnum -- this would give readers API
> flexibility
> >>>>>>      (not just index-file-format flexibility).  EG if someone
> wanted
> >>>>>>      to store payload at the term-doc level instead of
> >>>>>>      term-doc-position level, you could just add a new attribute.
> >>>>>>    * Test performance&  iterate.
> >>>>>>
> >>>>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [email protected]
> >>>> For additional commands, e-mail: [email protected]
> >>>>
> >>>>
> >>> --
> >>> - Mark
> >>>
> >>> http://www.lucidimagination.com
> >>>
> >>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >>>
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
> >
> >
> > --
> > - Mark
> >
> > http://www.lucidimagination.com
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Reply via email to