[jira] [Commented] (LUCENE-6065) remove foreign readers from merge, fix LeafReader instead.
[ https://issues.apache.org/jira/browse/LUCENE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219458#comment-14219458 ] Uwe Schindler commented on LUCENE-6065: --- Hi, I like the whole idea, one problem I see on my first review is FilterLeafReader2: Currently all methods in LeafReader2 are final, and the ones to actually implement are protected - which is fine, I like this very much!!! The filter/delegator pattern used by FilterLeafReader2 can only filter on the protected methods - because all others are final. This may make it impossible to have a subclass of FilterLeafReader2 outside oal.index package, because it may not be able to delegate to protected methods... I am not sure if this is really a problem here, but we had similar issues around MTQ and its rewrite methods in the past. But I think the filtering works, because we actually never delegate to other classes only to other instances... We should investigate with a simple test. Because these are issues prventing users from doing the right thing, just because we never test it with foreign packages. But in fact: I would really like to have the LeafReader2 impl methods protected!!! remove foreign readers from merge, fix LeafReader instead. Key: LUCENE-6065 URL: https://issues.apache.org/jira/browse/LUCENE-6065 Project: Lucene - Core Issue Type: Task Reporter: Robert Muir Attachments: LUCENE-6065.patch Currently, SegmentMerger has supported two classes of citizens being merged: # SegmentReader # foreign reader (e.g. some FilterReader) It does an instanceof check and executes the merge differently. In the SegmentReader case: stored field and term vectors are bulk-merged, norms and docvalues are transferred directly without piling up on the heap, CRC32 verification runs with IO locality of the data being merged, etc. Otherwise, we treat it as a foreign reader and its slow. This is just the low-level, it gets worse as you wrap with more stuff. A great example there is SortingMergePolicy: not only will it have the low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for all string docvalues fields being merged and other silliness that just makes matters worse. Another use case is 5.0 users wishing to upgrade from fieldcache to docvalues. This should be possible to implement with a simple incremental transition based on a mergepolicy that uses UninvertingReader. But we shouldnt populate internal fieldcache entries unnecessarily on merge and spike RAM until all those segment cores are released, and other issues like bulk merge of stored fields and not piling up norms should still work: its completely unrelated. There are more problems we can fix if we clean this up, checkindex/checkreader can run efficiently where it doesn't need to RAM spike like merging, we can remove the checkIntegrity() method completely from LeafReader, since it can always be accomplished on producers, etc. In general it would be nice to just have one codepath for merging that is as efficient as we can make it, and to support things like index modifications during merge. I spent a few weeks writing 3 different implementations to fix this (interface, optional abstract class, fix LeafReader), and the latter is the only one i don't completely hate: I think our APIs should be efficient for indexing as well as search. So the proposal is simple, its to instead refactor LeafReader to just require the producer APIs as abstract methods (and FilterReaders should work on that). The search-oriented APIs can just be final methods that defer to those. So we would add 5 abstract methods, but implement 10 current methods as final based on those, and then merging would always be efficient. {code} // new abstract codec-based apis /** * Expert: retrieve thread-private TermVectorsReader * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract TermVectorsReader getTermVectorsReader(); /** * Expert: retrieve thread-private StoredFieldsReader * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract StoredFieldsReader getFieldsReader(); /** * Expert: retrieve underlying NormsProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract NormsProducer getNormsReader(); /** * Expert: retrieve underlying DocValuesProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract DocValuesProducer getDocValuesReader(); /** * Expert: retrieve underlying FieldsProducer * @throws
[jira] [Commented] (LUCENE-6065) remove foreign readers from merge, fix LeafReader instead.
[ https://issues.apache.org/jira/browse/LUCENE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219473#comment-14219473 ] Robert Muir commented on LUCENE-6065: - My idea for that was, FilterLeafReader2 would implement those as final too actually, and expose explicit methods to wrap say, NormsReader. This is so that getMergeInstance() will automatically work (fast api by default), and subclasses won't have to wrap in two places. Given that filterreaders would use this api, i'm not sure there is a visibility problem with 'protected', since they wouldnt mess with those methods. I am still investigating, first i am trying to cleanup the producer apis themselves (e.g. try to remove nullness and other things) remove foreign readers from merge, fix LeafReader instead. Key: LUCENE-6065 URL: https://issues.apache.org/jira/browse/LUCENE-6065 Project: Lucene - Core Issue Type: Task Reporter: Robert Muir Attachments: LUCENE-6065.patch Currently, SegmentMerger has supported two classes of citizens being merged: # SegmentReader # foreign reader (e.g. some FilterReader) It does an instanceof check and executes the merge differently. In the SegmentReader case: stored field and term vectors are bulk-merged, norms and docvalues are transferred directly without piling up on the heap, CRC32 verification runs with IO locality of the data being merged, etc. Otherwise, we treat it as a foreign reader and its slow. This is just the low-level, it gets worse as you wrap with more stuff. A great example there is SortingMergePolicy: not only will it have the low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for all string docvalues fields being merged and other silliness that just makes matters worse. Another use case is 5.0 users wishing to upgrade from fieldcache to docvalues. This should be possible to implement with a simple incremental transition based on a mergepolicy that uses UninvertingReader. But we shouldnt populate internal fieldcache entries unnecessarily on merge and spike RAM until all those segment cores are released, and other issues like bulk merge of stored fields and not piling up norms should still work: its completely unrelated. There are more problems we can fix if we clean this up, checkindex/checkreader can run efficiently where it doesn't need to RAM spike like merging, we can remove the checkIntegrity() method completely from LeafReader, since it can always be accomplished on producers, etc. In general it would be nice to just have one codepath for merging that is as efficient as we can make it, and to support things like index modifications during merge. I spent a few weeks writing 3 different implementations to fix this (interface, optional abstract class, fix LeafReader), and the latter is the only one i don't completely hate: I think our APIs should be efficient for indexing as well as search. So the proposal is simple, its to instead refactor LeafReader to just require the producer APIs as abstract methods (and FilterReaders should work on that). The search-oriented APIs can just be final methods that defer to those. So we would add 5 abstract methods, but implement 10 current methods as final based on those, and then merging would always be efficient. {code} // new abstract codec-based apis /** * Expert: retrieve thread-private TermVectorsReader * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract TermVectorsReader getTermVectorsReader(); /** * Expert: retrieve thread-private StoredFieldsReader * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract StoredFieldsReader getFieldsReader(); /** * Expert: retrieve underlying NormsProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract NormsProducer getNormsReader(); /** * Expert: retrieve underlying DocValuesProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract DocValuesProducer getDocValuesReader(); /** * Expert: retrieve underlying FieldsProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract FieldsProducer getPostingsReader(); // user/search oriented public apis based on the above public final Fields fields(); public final void document(int, StoredFieldVisitor); public final Fields getTermVectors(int); public final NumericDocValues getNumericDocValues(String); public final Bits getDocsWithField(String); public final BinaryDocValues
[jira] [Commented] (LUCENE-6065) remove foreign readers from merge, fix LeafReader instead.
[ https://issues.apache.org/jira/browse/LUCENE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219513#comment-14219513 ] Uwe Schindler commented on LUCENE-6065: --- I agree. Actually you wrap something different than those readers. So maybe have some other class that you have on the lower level during merging. One class the holds all those (FooReader implementations on the index view). On the searching side LeafReader is a basic interface without any implementation. So maybe let it be a real java interface implemented by the codec (SegmentReader). But you never pass LeafReader to the merging api. But making everything that is the real LeafReader interface be a final implementation detail is just wrong. So just have a different type of API behind the scenes when merging that you can wrap. And keep LeafReader completely out when merging, just wrap something different. remove foreign readers from merge, fix LeafReader instead. Key: LUCENE-6065 URL: https://issues.apache.org/jira/browse/LUCENE-6065 Project: Lucene - Core Issue Type: Task Reporter: Robert Muir Attachments: LUCENE-6065.patch Currently, SegmentMerger has supported two classes of citizens being merged: # SegmentReader # foreign reader (e.g. some FilterReader) It does an instanceof check and executes the merge differently. In the SegmentReader case: stored field and term vectors are bulk-merged, norms and docvalues are transferred directly without piling up on the heap, CRC32 verification runs with IO locality of the data being merged, etc. Otherwise, we treat it as a foreign reader and its slow. This is just the low-level, it gets worse as you wrap with more stuff. A great example there is SortingMergePolicy: not only will it have the low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for all string docvalues fields being merged and other silliness that just makes matters worse. Another use case is 5.0 users wishing to upgrade from fieldcache to docvalues. This should be possible to implement with a simple incremental transition based on a mergepolicy that uses UninvertingReader. But we shouldnt populate internal fieldcache entries unnecessarily on merge and spike RAM until all those segment cores are released, and other issues like bulk merge of stored fields and not piling up norms should still work: its completely unrelated. There are more problems we can fix if we clean this up, checkindex/checkreader can run efficiently where it doesn't need to RAM spike like merging, we can remove the checkIntegrity() method completely from LeafReader, since it can always be accomplished on producers, etc. In general it would be nice to just have one codepath for merging that is as efficient as we can make it, and to support things like index modifications during merge. I spent a few weeks writing 3 different implementations to fix this (interface, optional abstract class, fix LeafReader), and the latter is the only one i don't completely hate: I think our APIs should be efficient for indexing as well as search. So the proposal is simple, its to instead refactor LeafReader to just require the producer APIs as abstract methods (and FilterReaders should work on that). The search-oriented APIs can just be final methods that defer to those. So we would add 5 abstract methods, but implement 10 current methods as final based on those, and then merging would always be efficient. {code} // new abstract codec-based apis /** * Expert: retrieve thread-private TermVectorsReader * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract TermVectorsReader getTermVectorsReader(); /** * Expert: retrieve thread-private StoredFieldsReader * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract StoredFieldsReader getFieldsReader(); /** * Expert: retrieve underlying NormsProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract NormsProducer getNormsReader(); /** * Expert: retrieve underlying DocValuesProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract DocValuesProducer getDocValuesReader(); /** * Expert: retrieve underlying FieldsProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract FieldsProducer getPostingsReader(); // user/search oriented public apis based on the above public final Fields fields(); public final void document(int, StoredFieldVisitor); public final
[jira] [Commented] (LUCENE-6065) remove foreign readers from merge, fix LeafReader instead.
[ https://issues.apache.org/jira/browse/LUCENE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219519#comment-14219519 ] Uwe Schindler commented on LUCENE-6065: --- In addition, in Java 8 we have interfaces that can be extended with default methods. Thats exactly what we have here somehow... remove foreign readers from merge, fix LeafReader instead. Key: LUCENE-6065 URL: https://issues.apache.org/jira/browse/LUCENE-6065 Project: Lucene - Core Issue Type: Task Reporter: Robert Muir Attachments: LUCENE-6065.patch Currently, SegmentMerger has supported two classes of citizens being merged: # SegmentReader # foreign reader (e.g. some FilterReader) It does an instanceof check and executes the merge differently. In the SegmentReader case: stored field and term vectors are bulk-merged, norms and docvalues are transferred directly without piling up on the heap, CRC32 verification runs with IO locality of the data being merged, etc. Otherwise, we treat it as a foreign reader and its slow. This is just the low-level, it gets worse as you wrap with more stuff. A great example there is SortingMergePolicy: not only will it have the low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for all string docvalues fields being merged and other silliness that just makes matters worse. Another use case is 5.0 users wishing to upgrade from fieldcache to docvalues. This should be possible to implement with a simple incremental transition based on a mergepolicy that uses UninvertingReader. But we shouldnt populate internal fieldcache entries unnecessarily on merge and spike RAM until all those segment cores are released, and other issues like bulk merge of stored fields and not piling up norms should still work: its completely unrelated. There are more problems we can fix if we clean this up, checkindex/checkreader can run efficiently where it doesn't need to RAM spike like merging, we can remove the checkIntegrity() method completely from LeafReader, since it can always be accomplished on producers, etc. In general it would be nice to just have one codepath for merging that is as efficient as we can make it, and to support things like index modifications during merge. I spent a few weeks writing 3 different implementations to fix this (interface, optional abstract class, fix LeafReader), and the latter is the only one i don't completely hate: I think our APIs should be efficient for indexing as well as search. So the proposal is simple, its to instead refactor LeafReader to just require the producer APIs as abstract methods (and FilterReaders should work on that). The search-oriented APIs can just be final methods that defer to those. So we would add 5 abstract methods, but implement 10 current methods as final based on those, and then merging would always be efficient. {code} // new abstract codec-based apis /** * Expert: retrieve thread-private TermVectorsReader * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract TermVectorsReader getTermVectorsReader(); /** * Expert: retrieve thread-private StoredFieldsReader * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract StoredFieldsReader getFieldsReader(); /** * Expert: retrieve underlying NormsProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract NormsProducer getNormsReader(); /** * Expert: retrieve underlying DocValuesProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract DocValuesProducer getDocValuesReader(); /** * Expert: retrieve underlying FieldsProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract FieldsProducer getPostingsReader(); // user/search oriented public apis based on the above public final Fields fields(); public final void document(int, StoredFieldVisitor); public final Fields getTermVectors(int); public final NumericDocValues getNumericDocValues(String); public final Bits getDocsWithField(String); public final BinaryDocValues getBinaryDocValues(String); public final SortedDocValues getSortedDocValues(String); public final SortedNumericDocValues getSortedNumericDocValues(String); public final SortedSetDocValues getSortedSetDocValues(String); public final NumericDocValues getNormValues(String); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To
[jira] [Commented] (LUCENE-6065) remove foreign readers from merge, fix LeafReader instead.
[ https://issues.apache.org/jira/browse/LUCENE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219528#comment-14219528 ] Uwe Schindler commented on LUCENE-6065: --- Maybe i was a little bit too complicated in my explanation, sorry. The main problem I have is: _a public search API where all public methods are final and the whole implementation is protected_ remove foreign readers from merge, fix LeafReader instead. Key: LUCENE-6065 URL: https://issues.apache.org/jira/browse/LUCENE-6065 Project: Lucene - Core Issue Type: Task Reporter: Robert Muir Attachments: LUCENE-6065.patch Currently, SegmentMerger has supported two classes of citizens being merged: # SegmentReader # foreign reader (e.g. some FilterReader) It does an instanceof check and executes the merge differently. In the SegmentReader case: stored field and term vectors are bulk-merged, norms and docvalues are transferred directly without piling up on the heap, CRC32 verification runs with IO locality of the data being merged, etc. Otherwise, we treat it as a foreign reader and its slow. This is just the low-level, it gets worse as you wrap with more stuff. A great example there is SortingMergePolicy: not only will it have the low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for all string docvalues fields being merged and other silliness that just makes matters worse. Another use case is 5.0 users wishing to upgrade from fieldcache to docvalues. This should be possible to implement with a simple incremental transition based on a mergepolicy that uses UninvertingReader. But we shouldnt populate internal fieldcache entries unnecessarily on merge and spike RAM until all those segment cores are released, and other issues like bulk merge of stored fields and not piling up norms should still work: its completely unrelated. There are more problems we can fix if we clean this up, checkindex/checkreader can run efficiently where it doesn't need to RAM spike like merging, we can remove the checkIntegrity() method completely from LeafReader, since it can always be accomplished on producers, etc. In general it would be nice to just have one codepath for merging that is as efficient as we can make it, and to support things like index modifications during merge. I spent a few weeks writing 3 different implementations to fix this (interface, optional abstract class, fix LeafReader), and the latter is the only one i don't completely hate: I think our APIs should be efficient for indexing as well as search. So the proposal is simple, its to instead refactor LeafReader to just require the producer APIs as abstract methods (and FilterReaders should work on that). The search-oriented APIs can just be final methods that defer to those. So we would add 5 abstract methods, but implement 10 current methods as final based on those, and then merging would always be efficient. {code} // new abstract codec-based apis /** * Expert: retrieve thread-private TermVectorsReader * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract TermVectorsReader getTermVectorsReader(); /** * Expert: retrieve thread-private StoredFieldsReader * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract StoredFieldsReader getFieldsReader(); /** * Expert: retrieve underlying NormsProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract NormsProducer getNormsReader(); /** * Expert: retrieve underlying DocValuesProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract DocValuesProducer getDocValuesReader(); /** * Expert: retrieve underlying FieldsProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract FieldsProducer getPostingsReader(); // user/search oriented public apis based on the above public final Fields fields(); public final void document(int, StoredFieldVisitor); public final Fields getTermVectors(int); public final NumericDocValues getNumericDocValues(String); public final Bits getDocsWithField(String); public final BinaryDocValues getBinaryDocValues(String); public final SortedDocValues getSortedDocValues(String); public final SortedNumericDocValues getSortedNumericDocValues(String); public final SortedSetDocValues getSortedSetDocValues(String); public final NumericDocValues getNormValues(String); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (LUCENE-6065) remove foreign readers from merge, fix LeafReader instead.
[ https://issues.apache.org/jira/browse/LUCENE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218334#comment-14218334 ] Michael McCandless commented on LUCENE-6065: +1 remove foreign readers from merge, fix LeafReader instead. Key: LUCENE-6065 URL: https://issues.apache.org/jira/browse/LUCENE-6065 Project: Lucene - Core Issue Type: Task Reporter: Robert Muir Currently, SegmentMerger has supported two classes of citizens being merged: # SegmentReader # foreign reader (e.g. some FilterReader) It does an instanceof check and executes the merge differently. In the SegmentReader case: stored field and term vectors are bulk-merged, norms and docvalues are transferred directly without piling up on the heap, CRC32 verification runs with IO locality of the data being merged, etc. Otherwise, we treat it as a foreign reader and its slow. This is just the low-level, it gets worse as you wrap with more stuff. A great example there is SortingMergePolicy: not only will it have the low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for all string docvalues fields being merged and other silliness that just makes matters worse. Another use case is 5.0 users wishing to upgrade from fieldcache to docvalues. This should be possible to implement with a simple incremental transition based on a mergepolicy that uses UninvertingReader. But we shouldnt populate internal fieldcache entries unnecessarily on merge and spike RAM until all those segment cores are released, and other issues like bulk merge of stored fields and not piling up norms should still work: its completely unrelated. There are more problems we can fix if we clean this up, checkindex/checkreader can run efficiently where it doesn't need to RAM spike like merging, we can remove the checkIntegrity() method completely from LeafReader, since it can always be accomplished on producers, etc. In general it would be nice to just have one codepath for merging that is as efficient as we can make it, and to support things like index modifications during merge. I spent a few weeks writing 3 different implementations to fix this (interface, optional abstract class, fix LeafReader), and the latter is the only one i don't completely hate: I think our APIs should be efficient for indexing as well as search. So the proposal is simple, its to instead refactor LeafReader to just require the producer APIs as abstract methods (and FilterReaders should work on that). The search-oriented APIs can just be final methods that defer to those. So we would add 5 abstract methods, but implement 10 current methods as final based on those, and then merging would always be efficient. {code} // new abstract codec-based apis /** * Expert: retrieve thread-private TermVectorsReader * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract TermVectorsReader getTermVectorsReader(); /** * Expert: retrieve thread-private StoredFieldsReader * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract StoredFieldsReader getFieldsReader(); /** * Expert: retrieve underlying NormsProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract NormsProducer getNormsReader(); /** * Expert: retrieve underlying DocValuesProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract DocValuesProducer getDocValuesReader(); /** * Expert: retrieve underlying FieldsProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract FieldsProducer getPostingsReader(); // user/search oriented public apis based on the above public final Fields fields(); public final void document(int, StoredFieldVisitor); public final Fields getTermVectors(int); public final NumericDocValues getNumericDocValues(String); public final Bits getDocsWithField(String); public final BinaryDocValues getBinaryDocValues(String); public final SortedDocValues getSortedDocValues(String); public final SortedNumericDocValues getSortedNumericDocValues(String); public final SortedSetDocValues getSortedSetDocValues(String); public final NumericDocValues getNormValues(String); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6065) remove foreign readers from merge, fix LeafReader instead.
[ https://issues.apache.org/jira/browse/LUCENE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218992#comment-14218992 ] ASF subversion and git services commented on LUCENE-6065: - Commit 1640670 from [~rcmuir] in branch 'dev/branches/lucene6065' [ https://svn.apache.org/r1640670 ] LUCENE-6065: current state remove foreign readers from merge, fix LeafReader instead. Key: LUCENE-6065 URL: https://issues.apache.org/jira/browse/LUCENE-6065 Project: Lucene - Core Issue Type: Task Reporter: Robert Muir Attachments: LUCENE-6065.patch Currently, SegmentMerger has supported two classes of citizens being merged: # SegmentReader # foreign reader (e.g. some FilterReader) It does an instanceof check and executes the merge differently. In the SegmentReader case: stored field and term vectors are bulk-merged, norms and docvalues are transferred directly without piling up on the heap, CRC32 verification runs with IO locality of the data being merged, etc. Otherwise, we treat it as a foreign reader and its slow. This is just the low-level, it gets worse as you wrap with more stuff. A great example there is SortingMergePolicy: not only will it have the low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for all string docvalues fields being merged and other silliness that just makes matters worse. Another use case is 5.0 users wishing to upgrade from fieldcache to docvalues. This should be possible to implement with a simple incremental transition based on a mergepolicy that uses UninvertingReader. But we shouldnt populate internal fieldcache entries unnecessarily on merge and spike RAM until all those segment cores are released, and other issues like bulk merge of stored fields and not piling up norms should still work: its completely unrelated. There are more problems we can fix if we clean this up, checkindex/checkreader can run efficiently where it doesn't need to RAM spike like merging, we can remove the checkIntegrity() method completely from LeafReader, since it can always be accomplished on producers, etc. In general it would be nice to just have one codepath for merging that is as efficient as we can make it, and to support things like index modifications during merge. I spent a few weeks writing 3 different implementations to fix this (interface, optional abstract class, fix LeafReader), and the latter is the only one i don't completely hate: I think our APIs should be efficient for indexing as well as search. So the proposal is simple, its to instead refactor LeafReader to just require the producer APIs as abstract methods (and FilterReaders should work on that). The search-oriented APIs can just be final methods that defer to those. So we would add 5 abstract methods, but implement 10 current methods as final based on those, and then merging would always be efficient. {code} // new abstract codec-based apis /** * Expert: retrieve thread-private TermVectorsReader * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract TermVectorsReader getTermVectorsReader(); /** * Expert: retrieve thread-private StoredFieldsReader * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract StoredFieldsReader getFieldsReader(); /** * Expert: retrieve underlying NormsProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract NormsProducer getNormsReader(); /** * Expert: retrieve underlying DocValuesProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract DocValuesProducer getDocValuesReader(); /** * Expert: retrieve underlying FieldsProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract FieldsProducer getPostingsReader(); // user/search oriented public apis based on the above public final Fields fields(); public final void document(int, StoredFieldVisitor); public final Fields getTermVectors(int); public final NumericDocValues getNumericDocValues(String); public final Bits getDocsWithField(String); public final BinaryDocValues getBinaryDocValues(String); public final SortedDocValues getSortedDocValues(String); public final SortedNumericDocValues getSortedNumericDocValues(String); public final SortedSetDocValues getSortedSetDocValues(String); public final NumericDocValues getNormValues(String); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (LUCENE-6065) remove foreign readers from merge, fix LeafReader instead.
[ https://issues.apache.org/jira/browse/LUCENE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219071#comment-14219071 ] ASF subversion and git services commented on LUCENE-6065: - Commit 1640691 from [~rcmuir] in branch 'dev/branches/lucene6065' [ https://svn.apache.org/r1640691 ] LUCENE-6065: reader.fields() never returns null remove foreign readers from merge, fix LeafReader instead. Key: LUCENE-6065 URL: https://issues.apache.org/jira/browse/LUCENE-6065 Project: Lucene - Core Issue Type: Task Reporter: Robert Muir Attachments: LUCENE-6065.patch Currently, SegmentMerger has supported two classes of citizens being merged: # SegmentReader # foreign reader (e.g. some FilterReader) It does an instanceof check and executes the merge differently. In the SegmentReader case: stored field and term vectors are bulk-merged, norms and docvalues are transferred directly without piling up on the heap, CRC32 verification runs with IO locality of the data being merged, etc. Otherwise, we treat it as a foreign reader and its slow. This is just the low-level, it gets worse as you wrap with more stuff. A great example there is SortingMergePolicy: not only will it have the low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for all string docvalues fields being merged and other silliness that just makes matters worse. Another use case is 5.0 users wishing to upgrade from fieldcache to docvalues. This should be possible to implement with a simple incremental transition based on a mergepolicy that uses UninvertingReader. But we shouldnt populate internal fieldcache entries unnecessarily on merge and spike RAM until all those segment cores are released, and other issues like bulk merge of stored fields and not piling up norms should still work: its completely unrelated. There are more problems we can fix if we clean this up, checkindex/checkreader can run efficiently where it doesn't need to RAM spike like merging, we can remove the checkIntegrity() method completely from LeafReader, since it can always be accomplished on producers, etc. In general it would be nice to just have one codepath for merging that is as efficient as we can make it, and to support things like index modifications during merge. I spent a few weeks writing 3 different implementations to fix this (interface, optional abstract class, fix LeafReader), and the latter is the only one i don't completely hate: I think our APIs should be efficient for indexing as well as search. So the proposal is simple, its to instead refactor LeafReader to just require the producer APIs as abstract methods (and FilterReaders should work on that). The search-oriented APIs can just be final methods that defer to those. So we would add 5 abstract methods, but implement 10 current methods as final based on those, and then merging would always be efficient. {code} // new abstract codec-based apis /** * Expert: retrieve thread-private TermVectorsReader * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract TermVectorsReader getTermVectorsReader(); /** * Expert: retrieve thread-private StoredFieldsReader * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract StoredFieldsReader getFieldsReader(); /** * Expert: retrieve underlying NormsProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract NormsProducer getNormsReader(); /** * Expert: retrieve underlying DocValuesProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract DocValuesProducer getDocValuesReader(); /** * Expert: retrieve underlying FieldsProducer * @throws AlreadyClosedException if this reader is closed * @lucene.internal */ protected abstract FieldsProducer getPostingsReader(); // user/search oriented public apis based on the above public final Fields fields(); public final void document(int, StoredFieldVisitor); public final Fields getTermVectors(int); public final NumericDocValues getNumericDocValues(String); public final Bits getDocsWithField(String); public final BinaryDocValues getBinaryDocValues(String); public final SortedDocValues getSortedDocValues(String); public final SortedNumericDocValues getSortedNumericDocValues(String); public final SortedSetDocValues getSortedSetDocValues(String); public final NumericDocValues getNormValues(String); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)