I've added missing enums classes, but everything else is looking good so far.
Michael McCandless (JIRA) wrote: > [ > https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765234#action_12765234 > ] > > Michael McCandless commented on LUCENE-1458: > -------------------------------------------- > > OK I think I've committed Mark's last patch onto this branch: > > https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458 > > and I also branched the 2.9 back-compat branch and committed the last back > compat patch: > > > https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458_2_9_back_compat_tests > > Mark can you check it out & see if I missed anything? > > >> Further steps towards flexible indexing >> --------------------------------------- >> >> Key: LUCENE-1458 >> URL: https://issues.apache.org/jira/browse/LUCENE-1458 >> Project: Lucene - Java >> Issue Type: New Feature >> Components: Index >> Affects Versions: 2.9 >> Reporter: Michael McCandless >> Assignee: Michael McCandless >> Priority: Minor >> Attachments: LUCENE-1458-back-compat.patch, >> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, >> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, >> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, >> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, >> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, >> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, >> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, >> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, >> LUCENE-1458.tar.bz2 >> >> >> I attached a very rough checkpoint of my current patch, to get early >> feedback. All tests pass, though back compat tests don't pass due to >> changes to package-private APIs plus certain bugs in tests that >> happened to work (eg call TermPostions.nextPosition() too many times, >> which the new API asserts against). >> [Aside: I think, when we commit changes to package-private APIs such >> that back-compat tests don't pass, we could go back, make a branch on >> the back-compat tag, commit changes to the tests to use the new >> package private APIs on that branch, then fix nightly build to use the >> tip of that branch?o] >> There's still plenty to do before this is committable! This is a >> rather large change: >> * Switches to a new more efficient terms dict format. This still >> uses tii/tis files, but the tii only stores term & long offset >> (not a TermInfo). At seek points, tis encodes term & freq/prox >> offsets absolutely instead of with deltas delta. Also, tis/tii >> are structured by field, so we don't have to record field number >> in every term. >> . >> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB >> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). >> . >> RAM usage when loading terms dict index is significantly less >> since we only load an array of offsets and an array of String (no >> more TermInfo array). It should be faster to init too. >> . >> This part is basically done. >> * Introduces modular reader codec that strongly decouples terms dict >> from docs/positions readers. EG there is no more TermInfo used >> when reading the new format. >> . >> There's nice symmetry now between reading & writing in the codec >> chain -- the current docs/prox format is captured in: >> {code} >> FormatPostingsTermsDictWriter/Reader >> FormatPostingsDocsWriter/Reader (.frq file) and >> FormatPostingsPositionsWriter/Reader (.prx file). >> {code} >> This part is basically done. >> * Introduces a new "flex" API for iterating through the fields, >> terms, docs and positions: >> {code} >> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum >> {code} >> This replaces TermEnum/Docs/Positions. SegmentReader emulates the >> old API on top of the new API to keep back-compat. >> >> Next steps: >> * Plug in new codecs (pulsing, pfor) to exercise the modularity / >> fix any hidden assumptions. >> * Expose new API out of IndexReader, deprecate old API but emulate >> old API on top of new one, switch all core/contrib users to the >> new API. >> * Maybe switch to AttributeSources as the base class for TermsEnum, >> DocsEnum, PostingsEnum -- this would give readers API flexibility >> (not just index-file-format flexibility). EG if someone wanted >> to store payload at the term-doc level instead of >> term-doc-position level, you could just add a new attribute. >> * Test performance & iterate. >> > > -- - Mark http://www.lucidimagination.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org