subject:"\[jira\] Commented\: \(LUCENE\-1458\) Further steps towards flexible indexing"


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785299#action_12785299
 ] 

Michael McCandless commented on LUCENE-1458:


bq. Interesting ... after many, many runs without seeing that testreopen gc 
overhead limit exceeded, I just hit it again randomly.

Sheesh this one is annoying :)

Oh, I see -- we still need to cutover the standard codec's terms dict hash to 
use DBLRU instead of LinkedHashMap; that should fix it.  And actually after we 
do that we should re-run perf tests of the MTQs -- LinkedHashMap caused serious 
GC problems when I was testing automaton query.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785301#action_12785301
 ] 

Mark Miller commented on LUCENE-1458:
-

Cool - was actually thinking about looking if you had done that yet last night 
(unrelatedly)

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785303#action_12785303
 ] 

Uwe Schindler commented on LUCENE-1458:
---

One thing I came along long time ago, but now with a new API it get's 
interesting again:

DocsEnum should extend DocIdSetIterator, that would make it simplier to use and 
implement e.g. in MatchAllDocQuery.Scorer, FieldCacheRangeFilter and so on. You 
could e.g. write a filter for all documents that simply returns the docs 
enumeration from IndexReader.

So it should be an abstract class that extends DocIdSetIterator. It has the 
same methods, only some methods must be a little bit renamed. The problem is, 
because java does not support multiple inheritace, we cannot also extends 
attributesource :-( Would DocIdSetIterator be an interface it would work (this 
is one of the cases where interfaces for really simple patterns can be used, 
like iterators).

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785305#action_12785305
 ] 

Michael McCandless commented on LUCENE-1458:


bq. Cool - was actually thinking about looking if you had done that yet last 
night (unrelatedly)

Feel free to fix it!

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785308#action_12785308
 ] 

Michael McCandless commented on LUCENE-1458:


bq. DocsEnum should extend DocIdSetIterator

It'd be great if we could find a way to do this without a big hairball of back 
compat code ;)  They are basically the same, except DocsEnum lets you get 
freq() for each doc, get the PositionsEnum positions(), and also provides a 
bulk read API (w/ default impl).

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785310#action_12785310
 ] 

Michael McCandless commented on LUCENE-1458:


bq. getAttributes() returning it and dynamically instantiating would be an 
idea. The same applies for TermsEnum, it should be separated for lazy init.

That's a good point (avoid cost of creating the AttributeSource) -- that makes 
complete sense.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785312#action_12785312
 ] 

Mark Miller commented on LUCENE-1458:
-

RE: the terms cache

Should and still try and do the reuse stuff, or should we just drop it and use 
the cache as it is now? (eg reusing the object that is removed, if one is 
removed) Looks like that would be harder to get done now.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785314#action_12785314
 ] 

Uwe Schindler commented on LUCENE-1458:
---

bq. It'd be great if we could find a way to do this without a big hairball of 
back compat code

DocsEnum is a new class, why not fit it from the beginning as DocIdSetIterator? 
In my opinion, as pointed out above, the AttributeSource stuff should go in as 
a lazy-init member behind getAttributes() / attributes().

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785356#action_12785356
 ] 

Michael McCandless commented on LUCENE-1458:


bq. Should we still try and do the reuse stuff, or should we just drop it and 
use the cache as it is now?

How about starting w/o reuse but leave a TODO saying we could/should 
investigate?

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-DocIdSetIterator.patch, 
 LUCENE-1458-DocIdSetIterator.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785360#action_12785360
 ] 

Michael McCandless commented on LUCENE-1458:


Patch looks good Uwe!

bq.  MatchAllDocsQuery is very simple to implement now as a ConstantScoreQuery 
on top of a Filter that returns the DocsEnum of the supplied IndexReader as 
iterator. Really cool.

Sweet!  Wait, using AllDocsEnum you mean?

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-DocIdSetIterator.patch, 
 LUCENE-1458-DocIdSetIterator.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785362#action_12785362
 ] 

Michael McCandless commented on LUCENE-1458:


bq. How about starting w/o reuse but leave a TODO saying we could/should 
investigate?

Actually, scratch that -- reuse is too hard in DBLRU -- I would say just no 
reuse now.  Trunk doesn't reuse either...

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-DocIdSetIterator.patch, 
 LUCENE-1458-DocIdSetIterator.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785365#action_12785365
 ] 

Uwe Schindler commented on LUCENE-1458:
---

bq. Sweet! Wait, using AllDocsEnum you mean?

Yes, but this class is package private and unused! AllTermDocs is used by 
SegmentReader to support termDocs(null), but not AllDocsEnum. There is no 
method in IndexReader that returns all docs?

The matchAllDocs was just an example, there are more use cases.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-DocIdSetIterator.patch, 
 LUCENE-1458-DocIdSetIterator.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail:

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784811#action_12784811
 ] 

Mark Miller commented on LUCENE-1458:
-

I've put the merge on hold for a bit - will try and come back to it tonight. 
Ive got to figure out why this BW compat test is failing, and haven't seen an 
obvious reason yet:

{code}
junit.framework.AssertionFailedError: expected: but was:
at 
org.apache.lucene.search.TestWildcard.testEmptyTerm(TestWildcard.java:108)
at 
org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:208)
{code}

Pipe in if you know. Hard to debug or run this test singular in Eclipse 
(because of how BW compat tests work), so its a slow slog to trouble shoot and 
I haven't had time yet.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784818#action_12784818
 ] 

Michael McCandless commented on LUCENE-1458:


I think that test failure was from my fix of BooleanQuery to take coord into 
account in equals  hashCode (LUCENE-2092)?  I hit exactly that same failure, 
and it required a fix on back-compat branch to just pass in true to the new 
BooleanQuery() done just before the assert.  Does that explain it?

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784819#action_12784819
 ] 

Michael McCandless commented on LUCENE-1458:


And, thanks for taking over on merging trunk down!  I'm especially looking 
forward to getting the faster unit tests (LUCENE-1844).

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784823#action_12784823
 ] 

Uwe Schindler commented on LUCENE-1458:
---

I have seen your change in the tests, too. The test just checks that no clauses 
are generated. In my opinion, it should  not compare to a empty BQ instance, 
instead just assert bq.clauses().size()==0.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784825#action_12784825
 ] 

Michael McCandless commented on LUCENE-1458:


bq. In my opinion, it should not compare to a empty BQ instance, instead just 
assert bq.clauses().size()==0.

+1, that'd be a good improvement -- I'll do that.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784829#action_12784829
 ] 

Uwe Schindler commented on LUCENE-1458:
---

I rewrote to:
{code}
public void testEmptyTerm() throws IOException {
RAMDirectory indexStore = getIndexStore(field, new 
String[]{nowildcard, nowildcardx});
IndexSearcher searcher = new IndexSearcher(indexStore, true);

MultiTermQuery wq = new WildcardQuery(new Term(field, ));
wq.setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
assertMatches(searcher, wq, 0);
Query q = searcher.rewrite(wq);
assertTrue(q instanceof BooleanQuery);
assertEquals(0, ((BooleanQuery) q).clauses().size());
}
{code}

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784873#action_12784873
 ] 

Michael McCandless commented on LUCENE-1458:


Looks great -- can/did you commit?

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784876#action_12784876
 ] 

Mark Miller commented on LUCENE-1458:
-

bq. Does that explain it?

That was my initial guess and try - but neither true nor false fixed it.

Looks like Uwes fix with side step the issue though? Sounds good to me :)

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784878#action_12784878
 ] 

Uwe Schindler commented on LUCENE-1458:
---

I can do this, but according to Mark, only with a new issue and patch... Just 
joking :-) 

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784903#action_12784903
 ] 

Mark Miller commented on LUCENE-1458:
-

Interesting ... after many, many runs without seeing that testreopen gc 
overhead limit exceeded, I just hit it again randomly.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784904#action_12784904
 ] 

Mark Miller commented on LUCENE-1458:
-

bq. I can do this, but according to Mark, only with a new issue and patch... 
Just joking 

I put it in the BW branch, but not the flex branch yet.

Yeah, I'm a hardass, but I'm not in charge - just giving my opinion :) And I 
like how most things are fairly loose - I just worry about going to far down a 
road it will be hard to come back from - usually its so easy to get consensus, 
its easy to ignore it - but I think thats dangerous.

And yes, I get that your just kidding, but for good reason - I don't mean to 
come off as the abrasive one, but sometimes I think someone has to, and since 
I'm already in that hole anyway ...

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785052#action_12785052
 ] 

Uwe Schindler commented on LUCENE-1458:
---

I put the better test into trunk/trunk BW. I could also put it into 3.0 and 
2.9, but I do not think that is needed :)

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12785058#action_12785058
 ] 

Uwe Schindler commented on LUCENE-1458:
---

Mike: When fixing the NRQ test Mark merged, I found a problem/inconsistency 
with FilteredTermsEnum:

Normal usage of a termsEnum is that it is positioned on the first term (e.g. 
after calling getTermsEnum()). Normally you have a do-while-loop and call 
next() at the end, which is fine. Most code using TermsEnums first checks 
inside the do-while if (term()==null) and then break (incorrect positioned or 
exhausted termsenum). As the call to term() does not check the returned term, 
it may contain an term, that should normally be filtered. The same happens if 
you call term() after it is exhausted. The FilteredTermsEnum should return null 
for term() and docFreq() if the enum is empty or exhausted. I have seen that 
you added empty() to it, but for consistency the FilteredTermsEnum should 
return null/-1.

I fixed the test to check for empty() (sorry for two commits, the assertNull 
check was wrong, I changed before committing).

Opinions?

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784509#action_12784509
 ] 

Mark Miller commented on LUCENE-1458:
-

I'm going to commit the latest merge to trunk in a bit.

In a recent commit, NumericRangeQuery was changed to return 
UnsupportedOperationException for getEnum - I think thats going to be a back 
compat break? For now I've commented out the back compat test and put a 
nocommit comment:

{code}
  @Override
  // nocommit: I think this needs to be implemented for back compat? When done, 
  // the back compat test for it in TestNumericRangeQuery32 should be 
uncommented.
  protected FilteredTermEnum getEnum(final IndexReader reader) throws 
IOException {
throw new UnsupportedOperationException(not implemented);
  }
{code}

I think we need to go back to returning the Enum? But I'm not sure why this 
change was made, so ...

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-NRQ.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458_rotate.patch, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784510#action_12784510
 ] 

Uwe Schindler commented on LUCENE-1458:
---

It is not a break: you cannot extend NumericRangeQuery (it's final), so you can 
never call that method (protected). Only if you pout your class into the same 
package, but that's illegal and not backed by bw compatibility.

(I explained that in the commit and Mike already wrote that in the comment). So 
please keep the code clean and do not readd this TE.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-NRQ.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458_rotate.patch, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784519#action_12784519
 ] 

Mark Miller commented on LUCENE-1458:
-

bq.  Mike already wrote that in the comment

In what comment? Would be helpful to have it in a comment above getEnum.

bq.  just comment it out in BW branch

Thats what I'll do. Did the BW branch pass when you did it? If not, it would be 
helpful to commit that fix too, or call out the break loudly in this thread - 
its difficult to keep up on everything and track all of this down for these 
merges.

bq.  So please keep the code clean and do not re-add this TE.

Oh, I had no plans to do it myself ;) I just commented out the BW compat test 
and put the comment you see above.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-NRQ.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458_rotate.patch, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784520#action_12784520
 ] 

Mark Miller commented on LUCENE-1458:
-

Though I do wonder ... if its not a break, why do we have the method there 
throwing UnsupportedExceptionOperation ... why isn't it just removed?

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-NRQ.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458_rotate.patch, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784532#action_12784532
 ] 

Uwe Schindler commented on LUCENE-1458:
---

Mark: The updated backwards branch does not pass because of this (I did not 
update my checkout, the Enum test was added before 3.0). So the test should be 
commented out there, too (but you said, you would do this). Else, I will do 
tomorrow, I am tired, I would produce to many errors - sorry.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-NRQ.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458_rotate.patch, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784536#action_12784536
 ] 

Uwe Schindler commented on LUCENE-1458:
---

I updated my commit comment above, so it's clear what I have done (copied from 
commit log message).

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-NRQ.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458_rotate.patch, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784537#action_12784537
 ] 

Mark Miller commented on LUCENE-1458:
-

bq. Else, I will do tomorrow, I am tired, I would produce to many errors - 
sorry.

No problem - I got it now - just wasn't sure. Thats why I brought it up :)

bq. It's in the log message not comment.

Yup - thats fine, no big deal. Was just saying it would be easier on me if 
there was a comment over it - I've got it now though - I'll just remove that 
method.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-NRQ.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458_rotate.patch, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784548#action_12784548
 ] 

Uwe Schindler commented on LUCENE-1458:
---

bq.  I'll just remove that method.

In my opinion the super method should throw UOE. If somebody misses to override 
either getTermsEnum() or getEnum() he will get a good message describing the 
problem, not just an NPE. The default impl of getTermsEnum() to return null is 
fine, because rewrite then delegates to getEnum(). If that also returns null, 
you get NPE.

We had the same problem with Filter.bits() after deprecation in 2.x - it was 
not solved very good. In the 2.9 TS BW layer / DocIdSetIterator bw layer it was 
done correctly.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-NRQ.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458_rotate.patch, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784563#action_12784563
 ] 

Mark Miller commented on LUCENE-1458:
-

Okay - thats sounds like a good idea - I'll leave it for after the merge is 
done though.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
 LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-30 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783584#action_12783584
 ] 

Uwe Schindler commented on LUCENE-1458:
---

I rewrote the NumericRangeTermsEnum, see revision 885360.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-30 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783596#action_12783596
 ] 

Michael McCandless commented on LUCENE-1458:


Thanks Uwe!

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-30 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783598#action_12783598
]

Michael McCandless commented on LUCENE-1458:

{quote}
fwiw here is a patch to use the algorithm from the unicode std for utf8 in
utf16 sort order.
they claim it is fast because there is no conditional branching... who knows
{quote}
We could try to test to see if we see a difference in practice...

For term text without surrogate content, the branch always goes one way, so the
CPU ought to predict it well and it may turn out to be faster using branching.

With surrogates, likely the lookup approach is faster since the branch has good
chance of going either way.

However, the lookup approach adds 256 bytes to CPUs memory cache, which I'm not
thrilled about. We have other places that do the same (NORM_TABLE in
Similarity, scoreCache in TermScorer), that I think are much more warranted to
make the time vs cache line tradeoff since they deal with a decent amount of
CPU.

Or maybe worrying about cache lines from way up in javaland is just silly ;)

I guess at this point I'd lean towards keeping the branch based comparator.

Further steps towards flexible indexing
---

Key: LUCENE-1458
URL: https://issues.apache.org/jira/browse/LUCENE-1458
Project: Lucene - Java
Issue Type: New Feature
Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
Attachments: LUCENE-1458-back-compat.patch,
LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2,
LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch,
LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch,
UnicodeTestCase.patch

I attached a very rough checkpoint of my current patch, to get early
feedback. All tests pass, though back compat tests don't pass due to
changes to package-private APIs plus certain bugs in tests that
happened to work (eg call TermPostions.nextPosition() too many times,
which the new API asserts against).
[Aside: I think, when we commit changes to package-private APIs such
that back-compat tests don't pass, we could go back, make a branch on
the back-compat tag, commit changes to the tests to use the new
package private APIs on that branch, then fix nightly build to use the
tip of that branch?o]
There's still plenty to do before this is committable! This is a
rather large change:
* Switches to a new more efficient terms dict format. This still
uses tii/tis files, but the tii only stores term long offset
(not a TermInfo). At seek points, tis encodes term freq/prox
offsets absolutely instead of with deltas delta. Also, tis/tii
are structured by field, so we don't have to record field number
in every term.
.
On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
- 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
.
RAM usage when loading terms dict index is significantly less
since we only load an array of offsets and an array of String (no
more TermInfo array). It should be faster to init too.
.
This part is basically done.
* Introduces modular reader codec that strongly decouples terms dict
from docs/positions readers. EG there is no more TermInfo used
when reading the new format.
.
There's nice symmetry now between reading writing in the codec
chain -- the current docs/prox format is captured in:
{code}
FormatPostingsTermsDictWriter/Reader
FormatPostingsDocsWriter/Reader (.frq file) and
FormatPostingsPositionsWriter/Reader (.prx file).
{code}
This part is basically done.
* Introduces a new flex API for iterating through the fields,
terms, docs and positions:
{code}
FieldProducer - TermsEnum - DocsEnum - PostingsEnum
{code}
This replaces TermEnum/Docs/Positions. SegmentReader emulates the
old API on top of the new API to keep back-compat.

Next steps:
* Plug in new codecs (pulsing, pfor) to exercise the modularity /
fix any hidden assumptions.
* Expose new API out of IndexReader, deprecate old API but emulate
old API on top of new one, switch all core/contrib users to the
new API.
* Maybe switch to AttributeSources

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-30 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783654#action_12783654
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. We could try to test to see if we see a difference in practice...

it is also very wierd to me that the method you are using is the one being used 
in ICU... if this one is faster why isnt ICU using it?
its also sketchy that the table as described in the unicode std doesn't even 
work anyway as described... so is anyone using it?

I like your reasoning, lets leave it alone for now... other things to work on 
that will surely help.


 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail:

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783471#action_12783471
]

Michael McCandless commented on LUCENE-1458:

OK I finally worked out a solution for the UTF16 sort order problem
(just committed).

I added a TermRef.Comparator class, for comparing TermRefs, and I
removed TermRef.compareTo, and fixed all low-level places in Lucene
that rely on sort order of terms to use this new API instead.

I changed the Terms/TermsEnum/TermsConsumer API, adding a
getTermComparator(), ie, the codec now determines the sort order for
terms in each field. For the core codecs (standard, pulsing,
intblock) I default to UTF16 sort order, for back compat, but you
could easily instantiate it yourself and use a different term sort.

I changed TestExternalCodecs to test this new capability, by sorting 2
of its fields in reversed unicode code point order.

While this means your codec is now completely free to define the
term sort order per field, in general Lucene queries will not behave
right if you do this, so it's obviously a very advanced use case.

I also changed (yet again!) how DocumentsWriter encodes the terms
bytes, to record the length (in bytes) of the term, up front, followed by the
term bytes (vs the trailing 0xff that I had switched to). The length
is a 1 or 2 byte vInt, ie if it's 128 it's 1 byte, else 2 bytes.
This approach means the TermRef.Collector doesn't have to deal with
0xff's (which was messy).

I think this also means that, to the flex API, a term is actually
opaque -- it's just a series of bytes. It need not be UTF8 bytes.
However, all of analysis, and then how TermsHash builds up these
byte[]s, and what queries do with these bytes, is clearly still very
much Unicode/UTF8. But one could, in theory (I haven't tested this!)
separately use the flex API to build up a segment whose terms are
arbitrary byte[]'s, eg maybe you want to use 4 bytes to encode int
values, and then interact with those terms at search time
using the flex API.

Further steps towards flexible indexing
---

Key: LUCENE-1458
URL: https://issues.apache.org/jira/browse/LUCENE-1458
Project: Lucene - Java
Issue Type: New Feature
Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
Attachments: LUCENE-1458-back-compat.patch,
LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2,
LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch,
UnicodeTestCase.patch, UnicodeTestCase.patch

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783475#action_12783475
 ] 

Uwe Schindler commented on LUCENE-1458:
---

Hi Mike,

I looked into your commit, looks good. You are right with your comment in NRQ, 
it will only work with UTF-8 or UTF-16. Ideally NRQ would simply not use string 
terms at all and work directly on the byte[], which should then be ordered in 
binary order.

Two things:
- The legacy NumericRangeTermEnum can be removed completely and the protected 
getEnum() should simply throw UOE. NRQ cannot be subclassed and nobody can call 
this method (maybe only classes in same package, but thats not supported). So 
the enum with the nocommit mark can be removed
- I changed the logic in the TermEnum in trunk and 3.0 (it no longer works 
recursive, see LUCENE-2087). We  should change this here, too. This makes also 
the enum simplier (and it looks more like the Automaton one). The methods in 
trunk 3.0 setEnum() and endEnum() both throw now UOE.

I will look into these two changes tomorrow and change the code.

Uwe

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783481#action_12783481
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. Ideally NRQ would simply not use string terms at all and work directly on 
the byte[], which should then be ordered in binary order.

but isn't this what it does already with the TermsEnum api? the TermRef itself 
is just byte[], and NRQ precomputes all the TermRef's it needs up front, there 
is no unicode conversion there.



 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783482#action_12783482
 ] 

Uwe Schindler commented on LUCENE-1458:
---

Robert: I know, because of that I said it works with UTF-8/UTF-16 comparator. 
It would *not* work with a reverse comparator as Mike uses in the test.

With directly on bytes[] I meant that it could not use chars at all and 
directly encode the numbers into byte[] with the full 8 bits per byte. The 
resulting byte[] would be never UTF-8, but if the new TermRef API would be able 
to handle this and also the TokenStreams, it would be fine. Only the terms 
format would change.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail:

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783485#action_12783485
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. With directly on bytes[] I meant that it could not use chars at all and 
directly encode the numbers into byte[] with the full 8 bits per byte. The 
resulting byte[] would be never UTF-8, but if the new TermRef API would be able 
to handle this and also the TokenStreams, it would be fine. Only the terms 
format would change.

Uwe, it looks like you can do this now (with the exception of tokenstreams). 

A partial solution for you which does work with tokenstreams, you could use 
indexablebinarystring which won't change between any unicode sort order... (it 
will not encode in any unicode range where there is a difference between the 
UTF-8/UTF32 and UTF-16). With this you could just compare bytes also, but you 
still would not have the full 8 bits per byte


 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

[
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783488#action_12783488
]

Uwe Schindler commented on LUCENE-1458:
---

bq. A partial solution for you which does work with tokenstreams, you could use
indexablebinarystring which won't change between any unicode sort order... (it
will not encode in any unicode range where there is a difference between the
UTF-8/UTF32 and UTF-16). With this you could just compare bytes also, but you
still would not have the full 8 bits per byte

This would not change anything, only would make the format incompatible. With
7bits/char the currently UTF-8 coded index is the smallest possible one (even
IndexableBinaryString would cost more bytes in the index, because if you would
use 14 of the 16 bits/char, most chars would take 3 bytes in index because of
UTF-8 vs. 2 bytes with the current encoding. Only the char[]/String
representation would take less space than currently. See the discussion with
Yonik about this and why we have choosen 7 bits/char. Also en-/decoding is much
faster).

For the TokenStreams: The idea is to create an additional Attribute:
BinaryTermAttribute that holds byte[]. If some tokenstream uses this attribute
instead of TermAttribute, the indexer would choose to write the bytes directly
to the index. NumericTokenStream could use this attribute and encode the
numbers directly to byte[] with 8 bits/byte.

Further steps towards flexible indexing
---

Key: LUCENE-1458
URL: https://issues.apache.org/jira/browse/LUCENE-1458
Project: Lucene - Java
Issue Type: New Feature
Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
Attachments: LUCENE-1458-back-compat.patch,
LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2,
LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch,
UnicodeTestCase.patch, UnicodeTestCase.patch

Next steps:
* Plug in new codecs (pulsing, pfor) to exercise the

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783490#action_12783490
 ] 

Uwe Schindler commented on LUCENE-1458:
---

As the codec is per field, we could also add an Attribute to TokenStream that 
holds the codec (the default is Standard). The indexer just uses the codec for 
the field from the TokenStream. NTS would use a NumericCodec (just thinking...) 
- will go sleeping now.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783489#action_12783489
 ] 

Robert Muir commented on LUCENE-1458:
-

Uwe you are right that the terms would be larger but they would have a more 
distinct alphabet (byte range) and might compare faster... I don't know which 
one is most important to NRQ really.

yeah I agree that encoding directly to byte[] is the way to go though, this 
would be nice for collation too...

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783491#action_12783491
 ] 

Uwe Schindler commented on LUCENE-1458:
---

bq. Uwe you are right that the terms would be larger but they would have a more 
distinct alphabet (byte range) and might compare faster... I don't know which 
one is most important to NRQ really. 

The new TermsEnum directly compares the byte[] arrays. Why should they compare 
faster when encoded by IndexableBinaryStringTools? Less bytes are faster to 
compare (it's one CPU instruction if optimized a very native x86/x64 loop). It 
may be faster if we need to decode to char[] but thats not the case (in flex 
branch).

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783492#action_12783492
 ] 

Michael McCandless commented on LUCENE-1458:


bq. I changed the logic in the TermEnum in trunk and 3.0 (it no longer works 
recursive, see LUCENE-2087). We should change this here, too.

Mark has been periodically re-syncing changes down from trunk... we should 
probably just let this change come in through his process (else I think we 
cause more conflicts).

bq. The legacy NumericRangeTermEnum can be removed completely and the protected 
getEnum() should simply throw UOE. NRQ cannot be subclassed and nobody can call 
this method (maybe only classes in same package, but thats not supported). So 
the enum with the nocommit mark can be removed

Ahh excellent.  Wanna commit that when you get a chance?

bq.  Ideally NRQ would simply not use string terms at all and work directly on 
the byte[], which should then be ordered in binary order.

That'd be great!

bq. With directly on bytes[] I meant that it could not use chars at all and 
directly encode the numbers into byte[] with the full 8 bits per byte. The 
resulting byte[] would be never UTF-8, but if the new TermRef API would be able 
to handle this and also the TokenStreams, it would be fine. Only the terms 
format would change.

Right, this is a change in analysis - DocumentsWriter -- somehow we have to 
allow a Token to carry a byte[] and that is directly indexes as the opaque 
term.  At search time NRQ is all byte[] already (unlike other queries, which 
are new String()'ing for every term on the enum).

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783493#action_12783493
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. Why should they compare faster when encoded by IndexableBinaryStringTools?

because it compares from left to right, so even if the terms are 10x as long, 
if they differ 2x as quick its better? 

I hear what you are saying about ASCII-only encoding, but if NRQ's model is 
always best, why do we have two separate encode byte[] into char[] models in 
lucene, one that NRQ is using, and one that collation is using!?


 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783494#action_12783494
 ] 

Michael McCandless commented on LUCENE-1458:


bq. The idea is to create an additional Attribute: BinaryTermAttribute that 
holds byte[]. If some tokenstream uses this attribute instead of TermAttribute, 
the indexer would choose to write the bytes directly to the index. 
NumericTokenStream could use this attribute and encode the numbers directly to 
byte[] with 8 bits/byte. - the new AttributeSource API was created just because 
of such customizations (not possible with Token).

This sounds like an interesting approach!  We'd have to work out some 
details... eg you presumably can't mix char[] term and byte[] term in the same 
field.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

[
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783496#action_12783496
]

Uwe Schindler commented on LUCENE-1458:
---

bq. because it compares from left to right, so even if the terms are 10x as
long, if they differ 2x as quick its better?
It would not compare faster because in UTF-8 encoding, only 7 bits are used for
encoding the chars. The 8th bit is just a marker (simply spoken). If this
marker is always 0 or always 1 does not make a difference, in UTF-8 only 7
bits/byte are used for data. And with UTF-8 in the 3rd byte more bits are
unused!

bq. I hear what you are saying about ASCII-only encoding, but if NRQ's model is
always best, why do we have two separate encode byte[] into char[] models in
lucene, one that NRQ is using, and one that collation is using!?

I do not know who made this IndexableBinaryStrings encoding, but it would not
work for NRQ at all with current trunk (too complicated during indexing and
decoding, because for NRQ, we also need to decode such char[] very fast for
populating the FieldCache). But as discussed with Yonik (do not know the
issue), the ASCII only encoding should always perform better (but needs more
memory in trunk, as char[] is used during indexing -- I think because of that
it was added). So the difference is not speed, its memory consumption.

Further steps towards flexible indexing
---

Key: LUCENE-1458
URL: https://issues.apache.org/jira/browse/LUCENE-1458
Project: Lucene - Java
Issue Type: New Feature
Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
Attachments: LUCENE-1458-back-compat.patch,
LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2,
LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch,
UnicodeTestCase.patch, UnicodeTestCase.patch

Next steps:
* Plug in new codecs (pulsing, pfor) to exercise the modularity /
fix any hidden assumptions.
* Expose new API out

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783499#action_12783499
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. It would not compare faster because in UTF-8 encoding, only 7 bits are used 
for encoding the chars

yeah you are right I dont think it will be faster on average (i was just posing 
the question because i dont really know NRQ), and you will waste 4 bits by 
using the first bit at the minimum.

i am just always trying to improve collation too, so that's why I am bugging 
you. I guess hopefully soon we have byte[] and can do it properly, and speed up 
both.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail:

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781859#action_12781859
 ] 

Michael McCandless commented on LUCENE-1458:


{quote}
in trunk, things sort in UTF-16 binary order.
in branch, things sort in UTF-8 binary order.
these are different...
{quote}

Ugh!  In the back of my mind I almost remembered this... I think this
was one reason why I didn't do this back in LUCENE-843 (I think we had
discussed this already, then... though maybe I'm suffering from déjà
vu).  I could swear at one point I had that fixup logic implemented in
a UTF-8/16 comparison method...

UTF-8 sort order (what flex branch has switched to) is true unicode
codepoint sort order, while UTF-16 is not when there are surrogate
pairs as well as high (= U+E000) unicode chars.  Sigh

So this is definitely a back compat problem.  And, unfortunately, even
if we like the true codepoint sort order, it's not easy to switch to
in a back-compat manner because if we write new segments into an old
index, SegmentMerger will be in big trouble when it tries to merge two
segments that had sorted the terms differently.

I would also prefer true codepoint sort order... but we can't break
back compat.

Though it would be nice to let the codec control the sort order -- eg
then (I think?) the ICU/CollationKeyFilter workaround wouldn't be
needed.

Fortunately the problem is isolated to how we sort the buffered
postings when it's time to flush a new segment, so I think w/ the
appropriate fixup logic (eg your comment at
https://issues.apache.org/jira/browse/LUCENE-1606?focusedCommentId=12781746page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12781746)
when comparing terms in oal.index.TermsHashPerField.comparePostings
during that sort, we can get back to UTF-16 sort order.


 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781874#action_12781874
 ] 

Robert Muir commented on LUCENE-1458:
-

{quote}
Though it would be nice to let the codec control the sort order - eg
then (I think?) the ICU/CollationKeyFilter workaround wouldn't be
needed.
{quote}

I like this idea by the way, flexible sorting.  although i like codepoint 
order better than code unit order, i hate binary order in general to be honest. 

its nice we have 'indexable'/fast collation right now, but its maybe not what 
users expect either (binary keys encoded into text).


 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781899#action_12781899
 ] 

Michael McCandless commented on LUCENE-1458:


bq. i hate binary order in general to be honest.

But binary order in this case is code point order.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781923#action_12781923
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. Ahh, gotchya. Well if we make the sort order pluggable, you could do that...

yes, then we could consider getting rid of the Collator/Locale-based range 
queries / sorts and things like that completely... which have performance 
problems.
you would have a better way to do it... 

but if you change the sort order, any part of lucene sensitive to it might 
break... maybe its dangerous.

maybe if we do it, it needs to be exposed properly so other components can 
change their behavior


 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail:

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781927#action_12781927
 ] 

Michael McCandless commented on LUCENE-1458:


Yes, this (customizing comparator for termrefs) would definitely be very 
advanced stuff...  you'd have to create your own codec to do it.  And we'd 
default to UTF16 sort order for back compat.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781935#action_12781935
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. Yes, this (customizing comparator for termrefs) would definitely be very 
advanced stuff... you'd have to create your own codec to do it. And we'd 
default to UTF16 sort order for back compat.

Agreed, changing the sort order breaks a lot of things (not just some crazy 
seeking around code that I write)

i.e. if 'ch' is a character in some collator and sorts b, before c (completely 
made up example, there are real ones like this though)
Then even prefixquery itself will fail!

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail:

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781938#action_12781938
 ] 

Uwe Schindler commented on LUCENE-1458:
---

...not to talk about TermRangeQueries and NumericRangeQueries. They rely on 
String.compareTo like the current terms dict.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-24 Thread DM Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781947#action_12781947
 ] 

DM Smith commented on LUCENE-1458:
--

bq. Yes, this (customizing comparator for termrefs) would definitely be very 
advanced stuff... you'd have to create your own codec to do it. And we'd 
default to UTF16 sort order for back compat.

For those of us working on texts in all different kinds of languages, it should 
not be very advanced stuff. It should be stock Lucene. A default UCA comparator 
would be good. And a way to provide a locale sensitive UCA comparator would 
also be good.

My use case is that each Lucene index typically has a single language or at 
least has a dominant language.

bq. ...not to talk about TermRangeQueries and NumericRangeQueries. They rely on 
String.compareTo like the current terms dict.
I think that String.compareTo works correctly on UCA collation keys.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781953#action_12781953
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. I think that String.compareTo works correctly on UCA collation keys.

No, because UCA collation keys are bytes :)
You are right that byte comparison on these keys works though.
But if we change the sort order like this, various components are not looking 
at keys, instead they are looking at the term text themselves.

I guess what I am saying is that there is a lot of assumptions in lucene right 
now, (prefixquery was my example) that look at term text and assume it is 
sorted in binary order.

bq. It should be stock Lucene
as much as I agree with you that default UCA should be stock lucene (with the 
capability to use an alternate locale or even tailored collator), this creates 
some practical problems, as mentioned above.
also the practical problem that collation in the JDK is poop and we would want 
ICU for good performance...


 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782015#action_12782015
 ] 

Robert Muir commented on LUCENE-1458:
-

{quote}
So this is definitely a back compat problem. And, unfortunately, even
if we like the true codepoint sort order, it's not easy to switch to
in a back-compat manner because if we write new segments into an old
index, SegmentMerger will be in big trouble when it tries to merge two
segments that had sorted the terms differently.
{quote}

Mike, I think it goes well beyond this. 
I think sort order is an exceptional low-level case that can trickle all the 
way up high into the application layer (including user perception itself), and 
create bugs.
Does a non-technical user in Hong Kong know how many codepoints each ideograph 
they enter are? 
Should they care? They will just not understand if things are in different 
order.

I think we are stuck with UTF-16 without a huge effort, which would not be 
worth it in any case.


 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781353#action_12781353
 ] 

Michael McCandless commented on LUCENE-1458:


bq. how do i seek to U+D866 in the term dictionary? I can do this with trunk...

But, that's an unpaired surrogate?  Ie, not a valid unicode character?
It's nice that the current API let's you seek based on an unpaired
surrogate, but that's not valid use of the API, right?

I guess if we want we can assert that the incoming TermRef is actually valid
unicode...

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781401#action_12781401
 ] 

Michael McCandless commented on LUCENE-1458:


bq. perhaps it would help convince you if i instead wrote the code as 
.terms(鬅.charAt(0));

I realize a java String can easily contain an unpaired surrogate (eg,
your test case) since it operates in code units not code points, but,
that's not valid unicode, right?

I mean you can't in general send such a string off to a library that
works w/ unicode (like Lucene) and expect the behavior to be well
defined.  Yes, it's neat that Lucene allows that today, but I don't
see that it's supposed to.

When we encounter an unpaired surrogate during indexing, we replace it
w/ the replacement char.  Why shouldn't we do the same when
searching/reading the index?

What should we do during searching if the unpaired surrogate is inside
the string (not at the end)?  Why should that be different?

bq. Please read Ch2 and 3 of the unicode standard if you want to do this.

Doesn't this apply here?  In 3.2 Conformance
(http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf) is this first
requirement (C1):

  * A process shall not interpret a high-surrogate code point or a
low-surrogate code point as an abstract character.

bq. I hope you can start to see how many east asian applications will break 
because of this.

But how would a search application based on an east asian language
actually create such a term?  In what situation would an unpaired
surrogate find its way down to TermEnum?

Eg when users enter searches, they enter whole unicode chars (code
points) at once (not code units / unpaired surrogates)?  I realize an
app could programmatically construct eg a PrefixQuery that has an
unpaired surrogate... but couldn't they just as easily pair it up
before sending it to Lucene?

bq.  i have applications that will break because of this.

OK, can you shed some more light on how/when your apps do this?


 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 UnicodeTestCase.patch, UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

[
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781420#action_12781420
]

Robert Muir commented on LUCENE-1458:
-

{quote}
I realize a java String can easily contain an unpaired surrogate (eg,
your test case) since it operates in code units not code points, but,
that's not valid unicode, right?
{quote}

it is valid unicode. it is a valid Unicode String. This is different than a
Term stored in the index, which will be stored as UTF-8, and thus purports to
be in a valid unicode encoding form.

However,
the conformance clauses do not prevent processes from operating on code
unit sequences that do not purport to be in a Unicode character encoding form.
For example, for performance reasons a low-level string operation may simply
operate directly on code units, without interpreting them as characters. See,
especially, the discussion under D89.

D89:
Unicode strings need not contain well-formed code unit sequences under all
conditions.
This is equivalent to saying that a particular Unicode string need not be in a
Unicode
encoding form.
• For example, it is perfectly reasonable to talk about an operation that takes
the
two Unicode 16-bit strings, 004D D800 and DF02 004D, each of which
contains an ill-formed UTF-16 code unit sequence, and concatenates them to
form another Unicode string 004D D800 DF02 004D, which contains a wellformed
UTF-16 code unit sequence. The first two Unicode strings are not in
UTF-16, but the resultant Unicode string is.

{quote}
But how would a search application based on an east asian language
actually create such a term? In what situation would an unpaired
surrogate find its way down to TermEnum?
{quote}
I gave an example already, where they use FuzzyQuery with say a prefix of one.
with the current code, even in the flex branch!!! this will create a lead
surrogate prefix.
There is code in the lucene core that does things like this (which I plan to
fix, and also try to preserve back compat!)
This makes it impossible to preserve back compat.

There is also probably a lot of non-lucene east asian code that does similar
things.
For example, someone with data from Hong Kong almost certainly encounters
suppl. characters, because
they are part of Big5-HKSCS. They may not be smart enough to know about this
situation, i.e. they might take a string, substring(0, 1) and do a prefix query.
right now this will work!

This is part of the idea that for most operations (such as prefix), in java,
supplementary characters work rather transparently.
If we do this, upgrading lucene to support for unicode 4.0 will be
significantly more difficult.

bq. OK, can you shed some more light on how/when your apps do this?

Yes, see LUCENE-1606. This library uses UTF-16 intervals for transitions, which
works fine because for its matching purposes, this is transparent.
So there is no need for it to be aware of suppl. characters. If we make this
change, I will need to refactor/rewrite a lot of this code, most likely the
underlying DFA library itself.
This is working in production for me, on chinese text outside of the BMP with
lucene right now. With this change, it will no longer work, and the enumerator
will most likely go into an infinite loop!

The main difference here is semantics, before IndexReader.terms() accepted as
input any Unicode String. Now it would tighten that restriction to only any
interchangeable UTF-8 string. Yet the input being used, will not be stored as
UTF-8 anywhere, and most certainly will not be interchanged! The paper i sent
on UTF-16 mentions problems like this, because its very reasonable and handy to
use code units for processing, since suppl. characters are so rare.

Further steps towards flexible indexing
---

I attached a very rough checkpoint of my current patch, to get early
feedback. All tests

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781582#action_12781582
 ] 

Michael McCandless commented on LUCENE-1458:


bq.  if the term ends with a lead surrogate, tack on \uDC00 to emulate the old 
behavior.

OK I think this is a good approach, in the emulate old on flex layer, and 
then in the docs for TermRef call out that the incoming String cannot contain 
unpaired surrogates?

Can you commit this, along with your test? Thanks!

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781589#action_12781589
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. OK I think this is a good approach, in the emulate old on flex layer, and 
then in the docs for TermRef call out that the incoming String cannot contain 
unpaired surrogates?

Just so you know, its not perfect back compat though. 
For perfect back compat I would have to iterate thru the string looking for 
unpaired surrogates.. at which point you truncate after, and tack on \uDC00 if 
its a high surrogate.
If its an unpaired low surrogate, I am not actually sure what the old API would 
do? My guess would be to replace with U+F000, but it depends how this was being 
handled before.

the joys of UTF-16 vs UTF-8 binary order...

I didnt do any of this, because in my opinion fixing just the trailing lead 
surrogate case is all we should worry about, especially since the lucene core 
itself does this.

I'll commit the patch and test, we can improve it in the future if you are 
worried about these corner-corner-corner cases, no problem.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781603#action_12781603
 ] 

Robert Muir commented on LUCENE-1458:
-

the patch and test are in revision 883485.
I added some javadocs to TermRef where it takes a String constructor as well.


 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781617#action_12781617
 ] 

Robert Muir commented on LUCENE-1458:
-

Mike, what to do about MultiTermQueries now?
they still have some problems, especially with regards to doing 'startsWith' 
some constant prefix, which might be unpaired lead surrogate (lucene problem)

I guess we need to specialize this case in their FilteredTermEnum (not 
TermsEnum), and if they are doing this stupid behavior, return null from 
getTermsEnum() ?
and force it to the old TermEnum which has some back compat shims for this case?


 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781624#action_12781624
 ] 

Robert Muir commented on LUCENE-1458:
-

Also, I am curious in general if we support any old index formats that might 
contain unpaired surrogates or \u in the term text.

This will be good to know when trying to fix unicode 4 issues, especially if we 
are doing things like compareTo() or startsWith() on the raw bytes.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-23 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781636#action_12781636
 ] 

Yonik Seeley commented on LUCENE-1458:
--

In general, I think things like unpaired surrogates should be undefined, giving 
us more room to optimize.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781635#action_12781635
 ] 

Michael McCandless commented on LUCENE-1458:


LUCENE-510 (fixed in 2.4 release) cutover new indexes to UTF8.

Before 2.4, here's what IndexOutput.writeString looked like:

{code}
  public void writeChars(String s, int start, int length)
   throws IOException {
final int end = start + length;
for (int i = start; i  end; i++) {
  final int code = (int)s.charAt(i);
  if (code = 0x01  code = 0x7F)
writeByte((byte)code);
  else if (((code = 0x80)  (code = 0x7FF)) || code == 0) {
writeByte((byte)(0xC0 | (code  6)));
writeByte((byte)(0x80 | (code  0x3F)));
  } else {
writeByte((byte)(0xE0 | (code  12)));
writeByte((byte)(0x80 | ((code  6)  0x3F)));
writeByte((byte)(0x80 | (code  0x3F)));
  }
}
  }
{code}

which I think can represent unpaired surrogates  \u just fine?

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781638#action_12781638
 ] 

Michael McCandless commented on LUCENE-1458:


Also, on the flex branch I believe \u is no longer reserved by Lucene, 
but we should not advertise that!  Terms data is stored in DocumentsWriter as 
UTF8 bytes, and I use 0xff byte (an invalid UTF8 byte) as end marker.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781677#action_12781677
 ] 

Michael McCandless commented on LUCENE-1458:


{quote}
the patch and test are in revision 883485.
I added some javadocs to TermRef where it takes a String constructor as well.
{quote}

Thanks Robert!

{quote}
Mike, what to do about MultiTermQueries now?
they still have some problems, especially with regards to doing 'startsWith' 
some constant prefix, which might be unpaired lead surrogate (lucene problem)
{quote}
Maybe open a new issue for this?  Or, don't we already have an issue open to 
fix how various queries handle surrogates?  Or I guess we could fix such 
queries to pair up the surrogate (add \uDC00)?

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781683#action_12781683
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. In general, I think things like unpaired surrogates should be undefined, 
giving us more room to optimize. 

This is not an option I feel, when Lucene is the one creating the problem (i.e. 
our multitermqueries that are unaware of utf-32 boundaries).


 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781689#action_12781689
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. Maybe open a new issue for this? Or, don't we already have an issue open to 
fix how various queries handle surrogates? Or I guess we could fix such queries 
to pair up the surrogate (add \uDC00)?

Mike, I have an issue open, for trunk. But it is not such a problem on trunk, 
because they work as expected in UTF-16 space
The move to byte[] creates the problem really, because then the existing 
problems in trunk, that happened to work, start to completely fail in UTF-8 
space.
and unfortunately, we can't use the \uDC00 trick for startsWith :)

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
 UnicodeTestCase.patch


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing