[jira] [Commented] (LUCENE-6065) remove foreign readers from merge, fix LeafReader instead.

2014-11-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219458#comment-14219458
 ] 

Uwe Schindler commented on LUCENE-6065:
---

Hi,
I like the whole idea, one problem I see on my first review is 
FilterLeafReader2:

Currently all methods in LeafReader2 are final, and the ones to actually 
implement are protected - which is fine, I like this very much!!! The 
filter/delegator pattern used by FilterLeafReader2 can only filter on the 
protected methods - because all others are final. This may make it impossible 
to have a subclass of FilterLeafReader2 outside oal.index package, because it 
may not be able to delegate to protected methods... I am not sure if this is 
really a problem here, but we had similar issues around MTQ and its rewrite 
methods in the past. But I think the filtering works, because we actually never 
delegate to other classes only to other instances... We should investigate with 
a simple test. Because these are issues prventing users from doing the right 
thing, just because we never test it with foreign packages.

But in fact: I would really like to have the LeafReader2 impl methods 
protected!!!

 remove foreign readers from merge, fix LeafReader instead.
 

 Key: LUCENE-6065
 URL: https://issues.apache.org/jira/browse/LUCENE-6065
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir
 Attachments: LUCENE-6065.patch


 Currently, SegmentMerger has supported two classes of citizens being merged:
 # SegmentReader
 # foreign reader (e.g. some FilterReader)
 It does an instanceof check and executes the merge differently. In the 
 SegmentReader case: stored field and term vectors are bulk-merged, norms and 
 docvalues are transferred directly without piling up on the heap, CRC32 
 verification runs with IO locality of the data being merged, etc. Otherwise, 
 we treat it as a foreign reader and its slow.
 This is just the low-level, it gets worse as you wrap with more stuff. A 
 great example there is SortingMergePolicy: not only will it have the 
 low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for 
 all string docvalues fields being merged and other silliness that just makes 
 matters worse.
 Another use case is 5.0 users wishing to upgrade from fieldcache to 
 docvalues. This should be possible to implement with a simple incremental 
 transition based on a mergepolicy that uses UninvertingReader. But we 
 shouldnt populate internal fieldcache entries unnecessarily on merge and 
 spike RAM until all those segment cores are released, and other issues like 
 bulk merge of stored fields and not piling up norms should still work: its 
 completely unrelated.
 There are more problems we can fix if we clean this up, 
 checkindex/checkreader can run efficiently where it doesn't need to RAM spike 
 like merging, we can remove the checkIntegrity() method completely from 
 LeafReader, since it can always be accomplished on producers, etc. In general 
 it would be nice to just have one codepath for merging that is as efficient 
 as we can make it, and to support things like index modifications during 
 merge.
 I spent a few weeks writing 3 different implementations to fix this 
 (interface, optional abstract class, fix LeafReader), and the latter is the 
 only one i don't completely hate: I think our APIs should be efficient for 
 indexing as well as search.
 So the proposal is simple, its to instead refactor LeafReader to just require 
 the producer APIs as abstract methods (and FilterReaders should work on 
 that). The search-oriented APIs can just be final methods that defer to those.
 So we would add 5 abstract methods, but implement 10 current methods as final 
 based on those, and then merging would always be efficient.
 {code}
   // new abstract codec-based apis
   /** 
* Expert: retrieve thread-private TermVectorsReader
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract TermVectorsReader getTermVectorsReader();
   /** 
* Expert: retrieve thread-private StoredFieldsReader
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract StoredFieldsReader getFieldsReader();
   
   /** 
* Expert: retrieve underlying NormsProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract NormsProducer getNormsReader();
   
   /** 
* Expert: retrieve underlying DocValuesProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract DocValuesProducer getDocValuesReader();
   
   /** 
* Expert: retrieve underlying FieldsProducer
* @throws 

[jira] [Commented] (LUCENE-6065) remove foreign readers from merge, fix LeafReader instead.

2014-11-20 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219473#comment-14219473
 ] 

Robert Muir commented on LUCENE-6065:
-

My idea for that was, FilterLeafReader2 would implement those as final too 
actually, and expose explicit methods to wrap say, NormsReader. This is so that 
getMergeInstance() will automatically work (fast api by default), and 
subclasses won't have to wrap in two places.

Given that filterreaders would use this api, i'm not sure there is a visibility 
problem with 'protected', since they wouldnt mess with those methods. I am 
still investigating, first i am trying to cleanup the producer apis themselves 
(e.g. try to remove nullness and other things)

 remove foreign readers from merge, fix LeafReader instead.
 

 Key: LUCENE-6065
 URL: https://issues.apache.org/jira/browse/LUCENE-6065
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir
 Attachments: LUCENE-6065.patch


 Currently, SegmentMerger has supported two classes of citizens being merged:
 # SegmentReader
 # foreign reader (e.g. some FilterReader)
 It does an instanceof check and executes the merge differently. In the 
 SegmentReader case: stored field and term vectors are bulk-merged, norms and 
 docvalues are transferred directly without piling up on the heap, CRC32 
 verification runs with IO locality of the data being merged, etc. Otherwise, 
 we treat it as a foreign reader and its slow.
 This is just the low-level, it gets worse as you wrap with more stuff. A 
 great example there is SortingMergePolicy: not only will it have the 
 low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for 
 all string docvalues fields being merged and other silliness that just makes 
 matters worse.
 Another use case is 5.0 users wishing to upgrade from fieldcache to 
 docvalues. This should be possible to implement with a simple incremental 
 transition based on a mergepolicy that uses UninvertingReader. But we 
 shouldnt populate internal fieldcache entries unnecessarily on merge and 
 spike RAM until all those segment cores are released, and other issues like 
 bulk merge of stored fields and not piling up norms should still work: its 
 completely unrelated.
 There are more problems we can fix if we clean this up, 
 checkindex/checkreader can run efficiently where it doesn't need to RAM spike 
 like merging, we can remove the checkIntegrity() method completely from 
 LeafReader, since it can always be accomplished on producers, etc. In general 
 it would be nice to just have one codepath for merging that is as efficient 
 as we can make it, and to support things like index modifications during 
 merge.
 I spent a few weeks writing 3 different implementations to fix this 
 (interface, optional abstract class, fix LeafReader), and the latter is the 
 only one i don't completely hate: I think our APIs should be efficient for 
 indexing as well as search.
 So the proposal is simple, its to instead refactor LeafReader to just require 
 the producer APIs as abstract methods (and FilterReaders should work on 
 that). The search-oriented APIs can just be final methods that defer to those.
 So we would add 5 abstract methods, but implement 10 current methods as final 
 based on those, and then merging would always be efficient.
 {code}
   // new abstract codec-based apis
   /** 
* Expert: retrieve thread-private TermVectorsReader
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract TermVectorsReader getTermVectorsReader();
   /** 
* Expert: retrieve thread-private StoredFieldsReader
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract StoredFieldsReader getFieldsReader();
   
   /** 
* Expert: retrieve underlying NormsProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract NormsProducer getNormsReader();
   
   /** 
* Expert: retrieve underlying DocValuesProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract DocValuesProducer getDocValuesReader();
   
   /** 
* Expert: retrieve underlying FieldsProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal  
*/
   protected abstract FieldsProducer getPostingsReader();
   // user/search oriented public apis based on the above
   public final Fields fields();
   public final void document(int, StoredFieldVisitor);
   public final Fields getTermVectors(int);
   public final NumericDocValues getNumericDocValues(String);
   public final Bits getDocsWithField(String);
   public final BinaryDocValues 

[jira] [Commented] (LUCENE-6065) remove foreign readers from merge, fix LeafReader instead.

2014-11-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219513#comment-14219513
 ] 

Uwe Schindler commented on LUCENE-6065:
---

I agree. Actually you wrap something different than those readers. So maybe 
have some other class that you have on the lower level during merging. One 
class the holds all those (FooReader implementations on the index view). On the 
searching side LeafReader is a basic interface without any implementation. So 
maybe let it be a real java interface implemented by the codec (SegmentReader). 
But you never pass LeafReader to the merging api. But making everything that 
is the real LeafReader interface be a final implementation detail is just wrong.

So just have a different type of API behind the scenes when merging that you 
can wrap. And keep LeafReader completely out when merging, just wrap something 
different.

 remove foreign readers from merge, fix LeafReader instead.
 

 Key: LUCENE-6065
 URL: https://issues.apache.org/jira/browse/LUCENE-6065
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir
 Attachments: LUCENE-6065.patch


 Currently, SegmentMerger has supported two classes of citizens being merged:
 # SegmentReader
 # foreign reader (e.g. some FilterReader)
 It does an instanceof check and executes the merge differently. In the 
 SegmentReader case: stored field and term vectors are bulk-merged, norms and 
 docvalues are transferred directly without piling up on the heap, CRC32 
 verification runs with IO locality of the data being merged, etc. Otherwise, 
 we treat it as a foreign reader and its slow.
 This is just the low-level, it gets worse as you wrap with more stuff. A 
 great example there is SortingMergePolicy: not only will it have the 
 low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for 
 all string docvalues fields being merged and other silliness that just makes 
 matters worse.
 Another use case is 5.0 users wishing to upgrade from fieldcache to 
 docvalues. This should be possible to implement with a simple incremental 
 transition based on a mergepolicy that uses UninvertingReader. But we 
 shouldnt populate internal fieldcache entries unnecessarily on merge and 
 spike RAM until all those segment cores are released, and other issues like 
 bulk merge of stored fields and not piling up norms should still work: its 
 completely unrelated.
 There are more problems we can fix if we clean this up, 
 checkindex/checkreader can run efficiently where it doesn't need to RAM spike 
 like merging, we can remove the checkIntegrity() method completely from 
 LeafReader, since it can always be accomplished on producers, etc. In general 
 it would be nice to just have one codepath for merging that is as efficient 
 as we can make it, and to support things like index modifications during 
 merge.
 I spent a few weeks writing 3 different implementations to fix this 
 (interface, optional abstract class, fix LeafReader), and the latter is the 
 only one i don't completely hate: I think our APIs should be efficient for 
 indexing as well as search.
 So the proposal is simple, its to instead refactor LeafReader to just require 
 the producer APIs as abstract methods (and FilterReaders should work on 
 that). The search-oriented APIs can just be final methods that defer to those.
 So we would add 5 abstract methods, but implement 10 current methods as final 
 based on those, and then merging would always be efficient.
 {code}
   // new abstract codec-based apis
   /** 
* Expert: retrieve thread-private TermVectorsReader
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract TermVectorsReader getTermVectorsReader();
   /** 
* Expert: retrieve thread-private StoredFieldsReader
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract StoredFieldsReader getFieldsReader();
   
   /** 
* Expert: retrieve underlying NormsProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract NormsProducer getNormsReader();
   
   /** 
* Expert: retrieve underlying DocValuesProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract DocValuesProducer getDocValuesReader();
   
   /** 
* Expert: retrieve underlying FieldsProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal  
*/
   protected abstract FieldsProducer getPostingsReader();
   // user/search oriented public apis based on the above
   public final Fields fields();
   public final void document(int, StoredFieldVisitor);
   public final 

[jira] [Commented] (LUCENE-6065) remove foreign readers from merge, fix LeafReader instead.

2014-11-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219519#comment-14219519
 ] 

Uwe Schindler commented on LUCENE-6065:
---

In addition, in Java 8 we have interfaces that can be extended with default 
methods. Thats exactly what we have here somehow...

 remove foreign readers from merge, fix LeafReader instead.
 

 Key: LUCENE-6065
 URL: https://issues.apache.org/jira/browse/LUCENE-6065
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir
 Attachments: LUCENE-6065.patch


 Currently, SegmentMerger has supported two classes of citizens being merged:
 # SegmentReader
 # foreign reader (e.g. some FilterReader)
 It does an instanceof check and executes the merge differently. In the 
 SegmentReader case: stored field and term vectors are bulk-merged, norms and 
 docvalues are transferred directly without piling up on the heap, CRC32 
 verification runs with IO locality of the data being merged, etc. Otherwise, 
 we treat it as a foreign reader and its slow.
 This is just the low-level, it gets worse as you wrap with more stuff. A 
 great example there is SortingMergePolicy: not only will it have the 
 low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for 
 all string docvalues fields being merged and other silliness that just makes 
 matters worse.
 Another use case is 5.0 users wishing to upgrade from fieldcache to 
 docvalues. This should be possible to implement with a simple incremental 
 transition based on a mergepolicy that uses UninvertingReader. But we 
 shouldnt populate internal fieldcache entries unnecessarily on merge and 
 spike RAM until all those segment cores are released, and other issues like 
 bulk merge of stored fields and not piling up norms should still work: its 
 completely unrelated.
 There are more problems we can fix if we clean this up, 
 checkindex/checkreader can run efficiently where it doesn't need to RAM spike 
 like merging, we can remove the checkIntegrity() method completely from 
 LeafReader, since it can always be accomplished on producers, etc. In general 
 it would be nice to just have one codepath for merging that is as efficient 
 as we can make it, and to support things like index modifications during 
 merge.
 I spent a few weeks writing 3 different implementations to fix this 
 (interface, optional abstract class, fix LeafReader), and the latter is the 
 only one i don't completely hate: I think our APIs should be efficient for 
 indexing as well as search.
 So the proposal is simple, its to instead refactor LeafReader to just require 
 the producer APIs as abstract methods (and FilterReaders should work on 
 that). The search-oriented APIs can just be final methods that defer to those.
 So we would add 5 abstract methods, but implement 10 current methods as final 
 based on those, and then merging would always be efficient.
 {code}
   // new abstract codec-based apis
   /** 
* Expert: retrieve thread-private TermVectorsReader
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract TermVectorsReader getTermVectorsReader();
   /** 
* Expert: retrieve thread-private StoredFieldsReader
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract StoredFieldsReader getFieldsReader();
   
   /** 
* Expert: retrieve underlying NormsProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract NormsProducer getNormsReader();
   
   /** 
* Expert: retrieve underlying DocValuesProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract DocValuesProducer getDocValuesReader();
   
   /** 
* Expert: retrieve underlying FieldsProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal  
*/
   protected abstract FieldsProducer getPostingsReader();
   // user/search oriented public apis based on the above
   public final Fields fields();
   public final void document(int, StoredFieldVisitor);
   public final Fields getTermVectors(int);
   public final NumericDocValues getNumericDocValues(String);
   public final Bits getDocsWithField(String);
   public final BinaryDocValues getBinaryDocValues(String);
   public final SortedDocValues getSortedDocValues(String);
   public final SortedNumericDocValues getSortedNumericDocValues(String);
   public final SortedSetDocValues getSortedSetDocValues(String);
   public final NumericDocValues getNormValues(String);
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To 

[jira] [Commented] (LUCENE-6065) remove foreign readers from merge, fix LeafReader instead.

2014-11-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219528#comment-14219528
 ] 

Uwe Schindler commented on LUCENE-6065:
---

Maybe i was a little bit too complicated in my explanation, sorry. The main 
problem I have is: _a public search API where all public methods are final and 
the whole implementation is protected_

 remove foreign readers from merge, fix LeafReader instead.
 

 Key: LUCENE-6065
 URL: https://issues.apache.org/jira/browse/LUCENE-6065
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir
 Attachments: LUCENE-6065.patch


 Currently, SegmentMerger has supported two classes of citizens being merged:
 # SegmentReader
 # foreign reader (e.g. some FilterReader)
 It does an instanceof check and executes the merge differently. In the 
 SegmentReader case: stored field and term vectors are bulk-merged, norms and 
 docvalues are transferred directly without piling up on the heap, CRC32 
 verification runs with IO locality of the data being merged, etc. Otherwise, 
 we treat it as a foreign reader and its slow.
 This is just the low-level, it gets worse as you wrap with more stuff. A 
 great example there is SortingMergePolicy: not only will it have the 
 low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for 
 all string docvalues fields being merged and other silliness that just makes 
 matters worse.
 Another use case is 5.0 users wishing to upgrade from fieldcache to 
 docvalues. This should be possible to implement with a simple incremental 
 transition based on a mergepolicy that uses UninvertingReader. But we 
 shouldnt populate internal fieldcache entries unnecessarily on merge and 
 spike RAM until all those segment cores are released, and other issues like 
 bulk merge of stored fields and not piling up norms should still work: its 
 completely unrelated.
 There are more problems we can fix if we clean this up, 
 checkindex/checkreader can run efficiently where it doesn't need to RAM spike 
 like merging, we can remove the checkIntegrity() method completely from 
 LeafReader, since it can always be accomplished on producers, etc. In general 
 it would be nice to just have one codepath for merging that is as efficient 
 as we can make it, and to support things like index modifications during 
 merge.
 I spent a few weeks writing 3 different implementations to fix this 
 (interface, optional abstract class, fix LeafReader), and the latter is the 
 only one i don't completely hate: I think our APIs should be efficient for 
 indexing as well as search.
 So the proposal is simple, its to instead refactor LeafReader to just require 
 the producer APIs as abstract methods (and FilterReaders should work on 
 that). The search-oriented APIs can just be final methods that defer to those.
 So we would add 5 abstract methods, but implement 10 current methods as final 
 based on those, and then merging would always be efficient.
 {code}
   // new abstract codec-based apis
   /** 
* Expert: retrieve thread-private TermVectorsReader
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract TermVectorsReader getTermVectorsReader();
   /** 
* Expert: retrieve thread-private StoredFieldsReader
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract StoredFieldsReader getFieldsReader();
   
   /** 
* Expert: retrieve underlying NormsProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract NormsProducer getNormsReader();
   
   /** 
* Expert: retrieve underlying DocValuesProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract DocValuesProducer getDocValuesReader();
   
   /** 
* Expert: retrieve underlying FieldsProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal  
*/
   protected abstract FieldsProducer getPostingsReader();
   // user/search oriented public apis based on the above
   public final Fields fields();
   public final void document(int, StoredFieldVisitor);
   public final Fields getTermVectors(int);
   public final NumericDocValues getNumericDocValues(String);
   public final Bits getDocsWithField(String);
   public final BinaryDocValues getBinaryDocValues(String);
   public final SortedDocValues getSortedDocValues(String);
   public final SortedNumericDocValues getSortedNumericDocValues(String);
   public final SortedSetDocValues getSortedSetDocValues(String);
   public final NumericDocValues getNormValues(String);
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (LUCENE-6065) remove foreign readers from merge, fix LeafReader instead.

2014-11-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218334#comment-14218334
 ] 

Michael McCandless commented on LUCENE-6065:


+1

 remove foreign readers from merge, fix LeafReader instead.
 

 Key: LUCENE-6065
 URL: https://issues.apache.org/jira/browse/LUCENE-6065
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir

 Currently, SegmentMerger has supported two classes of citizens being merged:
 # SegmentReader
 # foreign reader (e.g. some FilterReader)
 It does an instanceof check and executes the merge differently. In the 
 SegmentReader case: stored field and term vectors are bulk-merged, norms and 
 docvalues are transferred directly without piling up on the heap, CRC32 
 verification runs with IO locality of the data being merged, etc. Otherwise, 
 we treat it as a foreign reader and its slow.
 This is just the low-level, it gets worse as you wrap with more stuff. A 
 great example there is SortingMergePolicy: not only will it have the 
 low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for 
 all string docvalues fields being merged and other silliness that just makes 
 matters worse.
 Another use case is 5.0 users wishing to upgrade from fieldcache to 
 docvalues. This should be possible to implement with a simple incremental 
 transition based on a mergepolicy that uses UninvertingReader. But we 
 shouldnt populate internal fieldcache entries unnecessarily on merge and 
 spike RAM until all those segment cores are released, and other issues like 
 bulk merge of stored fields and not piling up norms should still work: its 
 completely unrelated.
 There are more problems we can fix if we clean this up, 
 checkindex/checkreader can run efficiently where it doesn't need to RAM spike 
 like merging, we can remove the checkIntegrity() method completely from 
 LeafReader, since it can always be accomplished on producers, etc. In general 
 it would be nice to just have one codepath for merging that is as efficient 
 as we can make it, and to support things like index modifications during 
 merge.
 I spent a few weeks writing 3 different implementations to fix this 
 (interface, optional abstract class, fix LeafReader), and the latter is the 
 only one i don't completely hate: I think our APIs should be efficient for 
 indexing as well as search.
 So the proposal is simple, its to instead refactor LeafReader to just require 
 the producer APIs as abstract methods (and FilterReaders should work on 
 that). The search-oriented APIs can just be final methods that defer to those.
 So we would add 5 abstract methods, but implement 10 current methods as final 
 based on those, and then merging would always be efficient.
 {code}
   // new abstract codec-based apis
   /** 
* Expert: retrieve thread-private TermVectorsReader
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract TermVectorsReader getTermVectorsReader();
   /** 
* Expert: retrieve thread-private StoredFieldsReader
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract StoredFieldsReader getFieldsReader();
   
   /** 
* Expert: retrieve underlying NormsProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract NormsProducer getNormsReader();
   
   /** 
* Expert: retrieve underlying DocValuesProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract DocValuesProducer getDocValuesReader();
   
   /** 
* Expert: retrieve underlying FieldsProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal  
*/
   protected abstract FieldsProducer getPostingsReader();
   // user/search oriented public apis based on the above
   public final Fields fields();
   public final void document(int, StoredFieldVisitor);
   public final Fields getTermVectors(int);
   public final NumericDocValues getNumericDocValues(String);
   public final Bits getDocsWithField(String);
   public final BinaryDocValues getBinaryDocValues(String);
   public final SortedDocValues getSortedDocValues(String);
   public final SortedNumericDocValues getSortedNumericDocValues(String);
   public final SortedSetDocValues getSortedSetDocValues(String);
   public final NumericDocValues getNormValues(String);
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6065) remove foreign readers from merge, fix LeafReader instead.

2014-11-19 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218992#comment-14218992
 ] 

ASF subversion and git services commented on LUCENE-6065:
-

Commit 1640670 from [~rcmuir] in branch 'dev/branches/lucene6065'
[ https://svn.apache.org/r1640670 ]

LUCENE-6065: current state

 remove foreign readers from merge, fix LeafReader instead.
 

 Key: LUCENE-6065
 URL: https://issues.apache.org/jira/browse/LUCENE-6065
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir
 Attachments: LUCENE-6065.patch


 Currently, SegmentMerger has supported two classes of citizens being merged:
 # SegmentReader
 # foreign reader (e.g. some FilterReader)
 It does an instanceof check and executes the merge differently. In the 
 SegmentReader case: stored field and term vectors are bulk-merged, norms and 
 docvalues are transferred directly without piling up on the heap, CRC32 
 verification runs with IO locality of the data being merged, etc. Otherwise, 
 we treat it as a foreign reader and its slow.
 This is just the low-level, it gets worse as you wrap with more stuff. A 
 great example there is SortingMergePolicy: not only will it have the 
 low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for 
 all string docvalues fields being merged and other silliness that just makes 
 matters worse.
 Another use case is 5.0 users wishing to upgrade from fieldcache to 
 docvalues. This should be possible to implement with a simple incremental 
 transition based on a mergepolicy that uses UninvertingReader. But we 
 shouldnt populate internal fieldcache entries unnecessarily on merge and 
 spike RAM until all those segment cores are released, and other issues like 
 bulk merge of stored fields and not piling up norms should still work: its 
 completely unrelated.
 There are more problems we can fix if we clean this up, 
 checkindex/checkreader can run efficiently where it doesn't need to RAM spike 
 like merging, we can remove the checkIntegrity() method completely from 
 LeafReader, since it can always be accomplished on producers, etc. In general 
 it would be nice to just have one codepath for merging that is as efficient 
 as we can make it, and to support things like index modifications during 
 merge.
 I spent a few weeks writing 3 different implementations to fix this 
 (interface, optional abstract class, fix LeafReader), and the latter is the 
 only one i don't completely hate: I think our APIs should be efficient for 
 indexing as well as search.
 So the proposal is simple, its to instead refactor LeafReader to just require 
 the producer APIs as abstract methods (and FilterReaders should work on 
 that). The search-oriented APIs can just be final methods that defer to those.
 So we would add 5 abstract methods, but implement 10 current methods as final 
 based on those, and then merging would always be efficient.
 {code}
   // new abstract codec-based apis
   /** 
* Expert: retrieve thread-private TermVectorsReader
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract TermVectorsReader getTermVectorsReader();
   /** 
* Expert: retrieve thread-private StoredFieldsReader
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract StoredFieldsReader getFieldsReader();
   
   /** 
* Expert: retrieve underlying NormsProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract NormsProducer getNormsReader();
   
   /** 
* Expert: retrieve underlying DocValuesProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract DocValuesProducer getDocValuesReader();
   
   /** 
* Expert: retrieve underlying FieldsProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal  
*/
   protected abstract FieldsProducer getPostingsReader();
   // user/search oriented public apis based on the above
   public final Fields fields();
   public final void document(int, StoredFieldVisitor);
   public final Fields getTermVectors(int);
   public final NumericDocValues getNumericDocValues(String);
   public final Bits getDocsWithField(String);
   public final BinaryDocValues getBinaryDocValues(String);
   public final SortedDocValues getSortedDocValues(String);
   public final SortedNumericDocValues getSortedNumericDocValues(String);
   public final SortedSetDocValues getSortedSetDocValues(String);
   public final NumericDocValues getNormValues(String);
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (LUCENE-6065) remove foreign readers from merge, fix LeafReader instead.

2014-11-19 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219071#comment-14219071
 ] 

ASF subversion and git services commented on LUCENE-6065:
-

Commit 1640691 from [~rcmuir] in branch 'dev/branches/lucene6065'
[ https://svn.apache.org/r1640691 ]

LUCENE-6065: reader.fields() never returns null

 remove foreign readers from merge, fix LeafReader instead.
 

 Key: LUCENE-6065
 URL: https://issues.apache.org/jira/browse/LUCENE-6065
 Project: Lucene - Core
  Issue Type: Task
Reporter: Robert Muir
 Attachments: LUCENE-6065.patch


 Currently, SegmentMerger has supported two classes of citizens being merged:
 # SegmentReader
 # foreign reader (e.g. some FilterReader)
 It does an instanceof check and executes the merge differently. In the 
 SegmentReader case: stored field and term vectors are bulk-merged, norms and 
 docvalues are transferred directly without piling up on the heap, CRC32 
 verification runs with IO locality of the data being merged, etc. Otherwise, 
 we treat it as a foreign reader and its slow.
 This is just the low-level, it gets worse as you wrap with more stuff. A 
 great example there is SortingMergePolicy: not only will it have the 
 low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for 
 all string docvalues fields being merged and other silliness that just makes 
 matters worse.
 Another use case is 5.0 users wishing to upgrade from fieldcache to 
 docvalues. This should be possible to implement with a simple incremental 
 transition based on a mergepolicy that uses UninvertingReader. But we 
 shouldnt populate internal fieldcache entries unnecessarily on merge and 
 spike RAM until all those segment cores are released, and other issues like 
 bulk merge of stored fields and not piling up norms should still work: its 
 completely unrelated.
 There are more problems we can fix if we clean this up, 
 checkindex/checkreader can run efficiently where it doesn't need to RAM spike 
 like merging, we can remove the checkIntegrity() method completely from 
 LeafReader, since it can always be accomplished on producers, etc. In general 
 it would be nice to just have one codepath for merging that is as efficient 
 as we can make it, and to support things like index modifications during 
 merge.
 I spent a few weeks writing 3 different implementations to fix this 
 (interface, optional abstract class, fix LeafReader), and the latter is the 
 only one i don't completely hate: I think our APIs should be efficient for 
 indexing as well as search.
 So the proposal is simple, its to instead refactor LeafReader to just require 
 the producer APIs as abstract methods (and FilterReaders should work on 
 that). The search-oriented APIs can just be final methods that defer to those.
 So we would add 5 abstract methods, but implement 10 current methods as final 
 based on those, and then merging would always be efficient.
 {code}
   // new abstract codec-based apis
   /** 
* Expert: retrieve thread-private TermVectorsReader
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract TermVectorsReader getTermVectorsReader();
   /** 
* Expert: retrieve thread-private StoredFieldsReader
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract StoredFieldsReader getFieldsReader();
   
   /** 
* Expert: retrieve underlying NormsProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract NormsProducer getNormsReader();
   
   /** 
* Expert: retrieve underlying DocValuesProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal 
*/
   protected abstract DocValuesProducer getDocValuesReader();
   
   /** 
* Expert: retrieve underlying FieldsProducer
* @throws AlreadyClosedException if this reader is closed
* @lucene.internal  
*/
   protected abstract FieldsProducer getPostingsReader();
   // user/search oriented public apis based on the above
   public final Fields fields();
   public final void document(int, StoredFieldVisitor);
   public final Fields getTermVectors(int);
   public final NumericDocValues getNumericDocValues(String);
   public final Bits getDocsWithField(String);
   public final BinaryDocValues getBinaryDocValues(String);
   public final SortedDocValues getSortedDocValues(String);
   public final SortedNumericDocValues getSortedNumericDocValues(String);
   public final SortedSetDocValues getSortedSetDocValues(String);
   public final NumericDocValues getNormValues(String);
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)