[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15764205#comment-15764205 ] Ferenczi Jim commented on LUCENE-7579: -- {quote} Maybe you can work on the 6.x back port in the meantime {quote} I am on it ! > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15764001#comment-15764001 ] Ferenczi Jim commented on LUCENE-7579: -- Thanks [~jpountz] and [~mikemccand] ! > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760882#comment-15760882 ] Ferenczi Jim commented on LUCENE-7579: -- I pushed another commit that removes the specialized API for sorting a StoredFieldsWriter. This is now done directly in the StoredFieldsConsumer with a custom CopyVisitor (copied from MergeVisitor). I've also added some asserts that check if unsorted segments were built from version prior to Lucene 7.0. We'll need to change the assert when this gets backported to 6.x. I could not add the assert on maybeSortReaders because IndexWriter.addIndexes uses the merge to add indices that could be unsorted. I don't know if this should be allowed or not but we can revisit this later. Other than that I think it's ready ! > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15753993#comment-15753993 ] Ferenczi Jim commented on LUCENE-7579: -- This new API is maybe a premature optim that should not be part of this change. What about removing the API and rollback to a non optimized copy that "visits" each doc and copy it like the StoredFieldsReader is doing? This way the function would be private on the StoredFieldsConsumer. We can still add the optimization you're describing later but it can be confusing if the writes of the index writer are not compressed the same way than the other writes for stored fields ? > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15753913#comment-15753913 ] Ferenczi Jim commented on LUCENE-7579: -- {quote} CompressingStoredFieldsWriter.sort should always have a CompressingStoredFieldsReader as an input, since the codec cannot change in the middle of the flush, so I think we should be able to skip the instanceof check? {quote} That's true for the only call we make to this new API but since it's public it could be call with a different fields reader in another use case ? I am not happy that I had to add this new public API in the StoredFieldsReader but it's the only way to make this optimized for the compressing case. {quote} It would personally help me to have comments eg. in MergeState.maybeSortReaders that the indexSort==null case may only happen for bwc reasons. Maybe we should also assert that if index sorting is configured, then the non-sorted segments can only have 6.2 or 6.3 as a version {quote} Agreed, I'll add an assert for the non-sorted case. I'll also add a comment to make it clear that index==null is handled for BWC reason in maybeSortReader. Thanks for having a look [~jpountz] > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15750937#comment-15750937 ] Ferenczi Jim commented on LUCENE-7579: -- {quote} We still need to wrap unsorted segments during the merge for BWC so SortingLeafReader should remain. {quote} We can still rewrite it to a SortingCodecReader and remove the SlowCodecReaderWrapper but that's another issue ;) {quote} I think we should push first to master, and let that bake some, and in the mean time work out the challenging 6.x back port? {quote} Agreed. I'll create a branch for the back port in my repo. {quote} I'll wait a day or so before committing to give others a chance to review; it's a large change. {quote} That's awesome [~mikemccand] ! Thanks for the review and testing. > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744895#comment-15744895 ] Ferenczi Jim commented on LUCENE-7579: -- I pushed another iteration to https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort I cleaned up the nocommit and added the implementation for sorting term vectors. {quote} Do any of the exceptions tests for IndexWriter get angry? Seems like if we hit an IOException e.g. during the renaming that SortingStoredFieldsConsumer.flush does we may leave undeleted files? Hmm or perhaps IW takes care of that by wrapping the directory itself... {quote} I added an abort method on the StoredFieldsWriter which deletes the remaining temporary files and did the same for the SortingTermVectorsConsumer. [~mikemccand] can you take a look ? > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7581) IndexWriter#updateDocValues can break index sorting
[ https://issues.apache.org/jira/browse/LUCENE-7581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7581: - Attachment: LUCENE-7581.patch Here is a patch that fails DV updates on a field involved in the index sort. I also modified TestIndexSorting#testConcurrentDVUpdates which now test DV updates that are not involved in the index sort. > IndexWriter#updateDocValues can break index sorting > --- > > Key: LUCENE-7581 > URL: https://issues.apache.org/jira/browse/LUCENE-7581 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > Attachments: LUCENE-7581.patch, LUCENE-7581.patch > > > IndexWriter#updateDocValues can break index sorting if it is called on a > field that is used in the index sorting specification. > TestIndexSorting has a test for this case: #testConcurrentDVUpdates > but only L1 merge are checked. Any LN merge would fail the test because the > inner sort of the segment is not re-compute during/after DV updates. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7575) UnifiedHighlighter: add requireFieldMatch=false support
[ https://issues.apache.org/jira/browse/LUCENE-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723586#comment-15723586 ] Ferenczi Jim commented on LUCENE-7575: -- {quote} I was thinking a bit more about the wastefulness of re-creating SpanQueries with different field that are otherwise identical. Some day we could refactor out from WSTE a Query -> SpanQuery conversion utility that furthermore allows you to re-target the field. With that in place, we could avoid the waste for PhraseQuery and MultiPhraseQuery – the most typical position-sensitive queries. {quote} I agree, I'll work on this shortly. Thanks for the hint ;) > UnifiedHighlighter: add requireFieldMatch=false support > --- > > Key: LUCENE-7575 > URL: https://issues.apache.org/jira/browse/LUCENE-7575 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Assignee: David Smiley > Fix For: 6.4 > > Attachments: LUCENE-7575.patch, LUCENE-7575.patch, LUCENE-7575.patch > > > The UnifiedHighlighter (like the PostingsHighlighter) only supports > highlighting queries for the same fields that are being highlighted. The > original Highlighter and FVH support loosening this, AKA > requireFieldMatch=false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723545#comment-15723545 ] Ferenczi Jim commented on LUCENE-7579: -- Thanks Mike, {quote} Can we rename freezed to frozen in BinaryDocValuesWriter? But: why would freezed ever be true when we call flush? Shouldn't it only be called once, even in the sorting case? {quote} This is a leftover that is not needed. The naming was wrong ;) and it's useless so I removed it. {quote} I also like how you were able to re-use the SortingXXX from SortingLeafReader. Later on we can maybe optimize some of these; e.g. SortingFields and CachedXXXDVs should be able to take advantage of the fact that the things they are sorting are all already in heap (the indexing buffer), the way you did with MutableSortingPointValues (cool). {quote} Totally agree, we can revisit later and see if we can optimize memory. I think it's already an optim vs master in terms of memory usage since we only "sort" the segment to be flushed instead of all "unsorted" segments during the merge. {quote} Can we block creating a SortingLeafReader now (make its constructor private)? We only now ever use its inner classes I think? And it is a dangerous class in the first place... if we can do that, maybe we rename it SortingCodecUtils or something, just for its inner classes. {quote} We still need to wrap unsorted segments during the merge for BWC so SortingLeafReader should remain. I have no idea when we can remove it since indices on older versions should still be compatible with this new one ? {quote} Do any of the exceptions tests for IndexWriter get angry? Seems like if we hit an IOException e.g. during the renaming that SortingStoredFieldsConsumer.flush does we may leave undeleted files? Hmm or perhaps IW takes care of that by wrapping the directory itself... {quote} Honestly I have no idea. I will dig. {quote} Can't you just pass sortMap::newToOld directly (method reference) instead of making the lambda here?: {quote} Indeed, thanks. {quote} I think the 6.x back port here is going to be especially tricky {quote} I bet but as it is the main part is done by reusing SortingLeafReader inner classes that exist in 6.x. I've also removed a nocommit in the AssertingLiveDocsFormat that now checks live docs even when they are sorted. > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of th
[jira] [Commented] (LUCENE-7575) UnifiedHighlighter: add requireFieldMatch=false support
[ https://issues.apache.org/jira/browse/LUCENE-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722580#comment-15722580 ] Ferenczi Jim commented on LUCENE-7575: -- Thanks David ! > UnifiedHighlighter: add requireFieldMatch=false support > --- > > Key: LUCENE-7575 > URL: https://issues.apache.org/jira/browse/LUCENE-7575 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Assignee: David Smiley > Attachments: LUCENE-7575.patch, LUCENE-7575.patch, LUCENE-7575.patch > > > The UnifiedHighlighter (like the PostingsHighlighter) only supports > highlighting queries for the same fields that are being highlighted. The > original Highlighter and FVH support loosening this, AKA > requireFieldMatch=false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7575) UnifiedHighlighter: add requireFieldMatch=false support
[ https://issues.apache.org/jira/browse/LUCENE-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7575: - Attachment: LUCENE-7575.patch Thanks David ! Here is a new patch to address your last comments. Now we have a FieldFilteringTermSet and extractTerms uses a simple HashSet. {quote} couldn't defaultFieldMatcher be initialized to non-null to match the same field? Then getFieldMatcher() would simply return it. {quote} Not as a Predicate since the predicate is only on the candidate field name. We could use a BiPredicate to always provide the current field name to the predicate but I find it simpler this way. > UnifiedHighlighter: add requireFieldMatch=false support > --- > > Key: LUCENE-7575 > URL: https://issues.apache.org/jira/browse/LUCENE-7575 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Assignee: David Smiley > Attachments: LUCENE-7575.patch, LUCENE-7575.patch, LUCENE-7575.patch > > > The UnifiedHighlighter (like the PostingsHighlighter) only supports > highlighting queries for the same fields that are being highlighted. The > original Highlighter and FVH support loosening this, AKA > requireFieldMatch=false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7581) IndexWriter#updateDocValues can break index sorting
[ https://issues.apache.org/jira/browse/LUCENE-7581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722476#comment-15722476 ] Ferenczi Jim commented on LUCENE-7581: -- [~mikemccand] I think so too. I'll work on a patch. > IndexWriter#updateDocValues can break index sorting > --- > > Key: LUCENE-7581 > URL: https://issues.apache.org/jira/browse/LUCENE-7581 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > Attachments: LUCENE-7581.patch > > > IndexWriter#updateDocValues can break index sorting if it is called on a > field that is used in the index sorting specification. > TestIndexSorting has a test for this case: #testConcurrentDVUpdates > but only L1 merge are checked. Any LN merge would fail the test because the > inner sort of the segment is not re-compute during/after DV updates. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7581) IndexWriter#updateDocValues can break index sorting
[ https://issues.apache.org/jira/browse/LUCENE-7581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7581: - Attachment: LUCENE-7581.patch I attached a patch that fails the test if a second round of DV updates are run. > IndexWriter#updateDocValues can break index sorting > --- > > Key: LUCENE-7581 > URL: https://issues.apache.org/jira/browse/LUCENE-7581 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > Attachments: LUCENE-7581.patch > > > IndexWriter#updateDocValues can break index sorting if it is called on a > field that is used in the index sorting specification. > TestIndexSorting has a test for this case: #testConcurrentDVUpdates > but only L1 merge are checked. Any LN merge would fail the test because the > inner sort of the segment is not re-compute during/after DV updates. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-7581) IndexWriter#updateDocValues can break index sorting
Ferenczi Jim created LUCENE-7581: Summary: IndexWriter#updateDocValues can break index sorting Key: LUCENE-7581 URL: https://issues.apache.org/jira/browse/LUCENE-7581 Project: Lucene - Core Issue Type: Bug Reporter: Ferenczi Jim IndexWriter#updateDocValues can break index sorting if it is called on a field that is used in the index sorting specification. TestIndexSorting has a test for this case: #testConcurrentDVUpdates but only L1 merge are checked. Any LN merge would fail the test because the inner sort of the segment is not re-compute during/after DV updates. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7575) UnifiedHighlighter: add requireFieldMatch=false support
[ https://issues.apache.org/jira/browse/LUCENE-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7575: - Attachment: LUCENE-7575.patch Thanks [~dsmiley] and [~Timothy055] ! I pushed a new patch to address your comments. {quote} it'd be interesting if instead of a simple boolean toggle, if it were a Predicate fieldMatchPredicate so that only some fields could be collected in the query but not all. Just an idea. {quote} I agree and this is why I changed the patch to include your idea. By default nothing changes, queries are extracted based on the field name to highlight. Though with this change the user can now define which query (based on the field name) should be highlighted. I think it's better like this but I can revert if you think this should not implemented in the first iteration. I fixed the bugs that David spotted (terms from different fields not sorted after filteredExtractTerms and redundant initialization of the filter leaf reader for the span queries) and split the tests based on the type of query that is tested. > UnifiedHighlighter: add requireFieldMatch=false support > --- > > Key: LUCENE-7575 > URL: https://issues.apache.org/jira/browse/LUCENE-7575 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Assignee: David Smiley > Attachments: LUCENE-7575.patch, LUCENE-7575.patch > > > The UnifiedHighlighter (like the PostingsHighlighter) only supports > highlighting queries for the same fields that are being highlighted. The > original Highlighter and FVH support loosening this, AKA > requireFieldMatch=false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15712572#comment-15712572 ] Ferenczi Jim commented on LUCENE-7579: -- I ran the test from a clean state and I can see a nice improvement with the sparsetaxis use case. I use https://github.com/mikemccand/luceneutil/blob/master/src/python/sparsetaxis/runBenchmark.py and compare two checkouts of Lucene, one with my branch and the other with master. For the master branch I have: {noformat} 838.0 sec: 20.0 M docs; 23.9 K docs/sec {noformat} ... vs the branch with the flush sort: {noformat} 612.2 sec: 20.0 M docs; 32.7 K docs/sec {noformat} I reproduce the same diff on each run :) > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-7579) Sorting on flushed segment
Ferenczi Jim created LUCENE-7579: Summary: Sorting on flushed segment Key: LUCENE-7579 URL: https://issues.apache.org/jira/browse/LUCENE-7579 Project: Lucene - Core Issue Type: Bug Reporter: Ferenczi Jim Today flushed segments built by an index writer with an index sort specified are not sorted. The merge is responsible of sorting these segments potentially with others that are already sorted (resulted from another merge). I'd like to investigate the cost of sorting the segment directly during the flush. This could make the merge faster since they are some cheap optimizations that can be done only if all segments to be merged are sorted. For instance the merge of the points could use the bulk merge instead of rebuilding the points from scratch. I made a small prototype which sort the segment on flush here: https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort The idea is simple, for points, norms, docvalues and terms I use the SortingLeafReader implementation to translate the values that we have in RAM in a sorted enumeration for the writers. For stored fields I use a two pass scheme where the documents are first written to disk unsorted and then copied to another file with the correct sorting. I use the same stored field format for the two steps and just remove the file produced by the first pass at the end of the process. This prototype has no implementation for index sorting that use term vectors yet. I'll add this later if the tests are good enough. Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts and compared master with index sorting against my branch with index sorting on flush. I tried with sparsetaxis and wikipedia and the first results are weird. When I use the SerialScheduler and only one thread to write the docs, index sorting on flush is slower. But when I use two threads the sorting on flush is much faster even with the SerialScheduler. I'll continue to run the tests in order to be able to share something more meaningful. The tests are passing except one about concurrent DV updates. I don't know this part at all so I did not fix the test yet. I don't even know if we can make it work with index sorting ;). [~mikemccand] I would love to have your feedback about the prototype. Could you please take a look ? I am sure there are plenty of bugs, ... but I think it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7575) UnifiedHighlighter: add requireFieldMatch=false support
[ https://issues.apache.org/jira/browse/LUCENE-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15706216#comment-15706216 ] Ferenczi Jim commented on LUCENE-7575: -- Hi [~dsmiley], I've attached a patch based on the comment above. I did not find a clean way to detect duplicates in the span queries extracted by the PhraseHelper when requireFieldMatch=false. I agree that it's not essential so I pushed the patch as is. Could you please take a look ? > UnifiedHighlighter: add requireFieldMatch=false support > --- > > Key: LUCENE-7575 > URL: https://issues.apache.org/jira/browse/LUCENE-7575 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley > Attachments: LUCENE-7575.patch > > > The UnifiedHighlighter (like the PostingsHighlighter) only supports > highlighting queries for the same fields that are being highlighted. The > original Highlighter and FVH support loosening this, AKA > requireFieldMatch=false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7575) UnifiedHighlighter: add requireFieldMatch=false support
[ https://issues.apache.org/jira/browse/LUCENE-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7575: - Attachment: LUCENE-7575.patch Patch for requireFieldMatch > UnifiedHighlighter: add requireFieldMatch=false support > --- > > Key: LUCENE-7575 > URL: https://issues.apache.org/jira/browse/LUCENE-7575 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley > Attachments: LUCENE-7575.patch > > > The UnifiedHighlighter (like the PostingsHighlighter) only supports > highlighting queries for the same fields that are being highlighted. The > original Highlighter and FVH support loosening this, AKA > requireFieldMatch=false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7574) Another TestIndexSorting failures
[ https://issues.apache.org/jira/browse/LUCENE-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695686#comment-15695686 ] Ferenczi Jim commented on LUCENE-7574: -- {quote}Can you explain the first patch a bit more? It felt correct to me to take whether the sort is reversed into account in order to compute the missing ordinal?{quote} Yes but if we do it twice and that reverts the sort on missing ordinals, so my patch is just removing one of them. Line 176 in MultiSorter.CrossReaderComparator reverseMul is applied to the result of the ordinals comparison. This is ok to do it on missing ordinals as well but we already applied the reverseMul on missing ordinal line 160 so the result is reversed. I hope it makes sense. > Another TestIndexSorting failures > - > > Key: LUCENE-7574 > URL: https://issues.apache.org/jira/browse/LUCENE-7574 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 6.x, master (7.0) >Reporter: Ferenczi Jim > Attachments: LUCENE-7574-1.patch, LUCENE-7574-2.patch > > > TestIndexSorting still fails with some seeds: > {noformat} >[junit4] Suite: org.apache.lucene.index.TestIndexSorting >[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexSorting > -Dtests.method=testRandom3 -Dtests.seed=6E45BA611FCD7241 -Dtests.slow=true > -Dtests.locale=ko-KR -Dtests.timezone=America/Cordoba -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1 >[junit4] FAILURE 0.64s J1 | TestIndexSorting.testRandom3 <<< >[junit4]> Throwable #1: org.junit.ComparisonFailure: expected:<[449]> > but was:<[650]> >[junit4]> at > __randomizedtesting.SeedInfo.seed([6E45BA611FCD7241:CC9DF4BB7B3F5B47]:0) >[junit4]> at > org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2264) >[junit4]> at java.lang.Thread.run(Thread.java:745) >[junit4] 2> NOTE: leaving temporary files on disk at: > /var/lib/jenkins/workspace/apache+lucene-solr+branch_6x/lucene/build/core/test/J1/temp/lucene.index.TestIndexSorting_6E45BA611FCD7241-001 >[junit4] 2> NOTE: test params are: codec=Asserting(Lucene62): > {docs=PostingsFormat(name=MockRandom), norms=PostingsFormat(name=MockRandom), > positions=PostingsFormat(name=Memory doPackFST= true), > id=PostingsFormat(name=MockRandom), term_vectors=FSTOrd50}, > docValues:{multi_valued_long=DocValuesFormat(name=Direct), > double=DocValuesFormat(name=Lucene54), foo=DocValuesFormat(name=Direct), > numeric=DocValuesFormat(name=Lucene54), > positions=DocValuesFormat(name=Direct), > multi_valued_numeric=DocValuesFormat(name=Memory), > float=DocValuesFormat(name=Lucene54), int=DocValuesFormat(name=Memory), > long=DocValuesFormat(name=Lucene54), points=DocValuesFormat(name=Memory), > sorted=DocValuesFormat(name=Lucene54), > multi_valued_double=DocValuesFormat(name=Memory), > docs=DocValuesFormat(name=Memory), > multi_valued_string=DocValuesFormat(name=Memory), > norms=DocValuesFormat(name=Memory), bytes=DocValuesFormat(name=Memory), > binary=DocValuesFormat(name=Lucene54), id=DocValuesFormat(name=Memory), > multi_valued_int=DocValuesFormat(name=Lucene54), > multi_valued_bytes=DocValuesFormat(name=Lucene54), > multi_valued_float=DocValuesFormat(name=Lucene54), > term_vectors=DocValuesFormat(name=Lucene54)}, maxPointsInLeafNode=419, > maxMBSortInHeap=7.394324294878203, > sim=RandomSimilarity(queryNorm=true,coord=yes): {positions=DFR I(n)Z(0.3), > id=IB SPL-L1, term_vectors=DFR I(ne)B3(800.0)}, locale=ko-KR, > timezone=America/Cordoba >[junit4] 2> NOTE: Linux 3.12.60-52.54-default amd64/Oracle Corporation > 1.8.0_111 (64-bit)/cpus=4,threads=1,free=105782608,total=240648192 >[junit4] 2> NOTE: All tests run in this JVM: [TestPayloads, > TestSnapshotDeletionPolicy, TestDocValues, FuzzyTermOnShortTermsTest, > TestNoMergeScheduler, TestPointValues, TestSegmentInfos, TestStressIndexing2, > TestSimpleExplanationsOfNonMatches, TestPrefixInBooleanQuery, TestTermQuery, > TestSegmentMerger, TestByteArrayDataInput, TestTransactions, TestMultiFields, > TestNRTReaderCleanup, TestPackedInts, TestIndexWriterExceptions, > TestSleepingLockWrapper, TestBlockPostingsFormat2, TestLSBRadixSorter, > TestSwappedIndexFiles, TestIndexWriterCommit, TestPrefixRandom, > Test4GBStoredFields, TestFuzzyQuery, TestCodecUtil, > TestSimpleSearchEquivalence, TestWeakIdentityMap, TestIndexWriterOnDiskFull, > TestTopDocsMerge, TestOmitTf, TestDuelingCodecs, TestRAMDirectory, > TestFlushByRamOrCountsPolicy, TestDemo, TestSimpleExplanationsWithFillerDocs, > TestByteSlices, TestParallelLeafReader, TestSortedSetSelector, > TestBagOfPostings, TestDemoParallelLeafReader, TestTopFieldCollector, > TestSearchForDuplicates, TestStringHelper, TestTragicIndexWriter
[jira] [Commented] (LUCENE-7569) TestIndexSorting failures
[ https://issues.apache.org/jira/browse/LUCENE-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695345#comment-15695345 ] Ferenczi Jim commented on LUCENE-7569: -- I opened https://issues.apache.org/jira/browse/LUCENE-7574 for the recent failures. Sorry for the noise. > TestIndexSorting failures > - > > Key: LUCENE-7569 > URL: https://issues.apache.org/jira/browse/LUCENE-7569 > Project: Lucene - Core > Issue Type: Bug >Reporter: Steve Rowe >Assignee: Michael McCandless >Priority: Blocker > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7569.patch > > > My Jenkins found two reproducing seeds on branch_6x - these look different, > but the failures happened on consecutive nightly runs: > {noformat} > Checking out Revision 535bf59a3b239f5c7bcd8c00f3e452c9b5e9b539 > (refs/remotes/origin/branch_6x) > [...] > [junit4] Suite: org.apache.lucene.index.TestIndexSorting >[junit4] 2> ??? 18, 2016 9:50:39 AM > com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler > uncaughtException >[junit4] 2> WARNING: Uncaught exception in thread: Thread[Lucene Merge > Thread #0,5,TGRP-TestIndexSorting] >[junit4] 2> org.apache.lucene.index.MergePolicy$MergeException: > java.lang.AssertionError: nextValue=4594289799775307848 vs > previous=4606302611760746829 >[junit4] 2>at > __randomizedtesting.SeedInfo.seed([5F8898DCABBFD056]:0) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648) >[junit4] 2> Caused by: java.lang.AssertionError: > nextValue=4594289799775307848 vs previous=4606302611760746829 >[junit4] 2>at > org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152) >[junit4] 2>at > org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470) >[junit4] 2>at > org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243) >[junit4] 2>at > org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:153) >[junit4] 2>at > org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:167) >[junit4] 2>at > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:111) >[junit4] 2>at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4320) >[junit4] 2>at > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3897) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) >[junit4] 2> >[junit4] 2> NOTE: download the large Jenkins line-docs file by running > 'ant get-jenkins-line-docs' in the lucene directory. >[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexSorting > -Dtests.method=testRandom3 -Dtests.seed=5F8898DCABBFD056 -Dtests.multiplier=2 > -Dtests.nightly=true -Dtests.slow=true > -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt > -Dtests.locale=he -Dtests.timezone=Canada/Central -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1 >[junit4] ERROR 14.9s J5 | TestIndexSorting.testRandom3 <<< >[junit4]> Throwable #1: > org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed >[junit4]>at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:748) >[junit4]>at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:762) >[junit4]>at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1566) >[junit4]>at > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1315) >[junit4]>at > org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2019) >[junit4]>at java.lang.Thread.run(Thread.java:745) >[junit4]> Caused by: java.lang.AssertionError: > nextValue=4594289799775307848 vs previous=4606302611760746829 >[junit4]>at > org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152) >[junit4]>at > org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470) >[junit4]>at >
[jira] [Updated] (LUCENE-7574) Another TestIndexSorting failures
[ https://issues.apache.org/jira/browse/LUCENE-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7574: - Attachment: LUCENE-7574-2.patch This second patch is for the tesNumericAlreadySorted failure and should be applied on master and branch_6x. This test didn't expect that merge with a single segment was possible. > Another TestIndexSorting failures > - > > Key: LUCENE-7574 > URL: https://issues.apache.org/jira/browse/LUCENE-7574 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 6.x, master (7.0) >Reporter: Ferenczi Jim > Attachments: LUCENE-7574-1.patch, LUCENE-7574-2.patch > > > TestIndexSorting still fails with some seeds: > {noformat} >[junit4] Suite: org.apache.lucene.index.TestIndexSorting >[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexSorting > -Dtests.method=testRandom3 -Dtests.seed=6E45BA611FCD7241 -Dtests.slow=true > -Dtests.locale=ko-KR -Dtests.timezone=America/Cordoba -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1 >[junit4] FAILURE 0.64s J1 | TestIndexSorting.testRandom3 <<< >[junit4]> Throwable #1: org.junit.ComparisonFailure: expected:<[449]> > but was:<[650]> >[junit4]> at > __randomizedtesting.SeedInfo.seed([6E45BA611FCD7241:CC9DF4BB7B3F5B47]:0) >[junit4]> at > org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2264) >[junit4]> at java.lang.Thread.run(Thread.java:745) >[junit4] 2> NOTE: leaving temporary files on disk at: > /var/lib/jenkins/workspace/apache+lucene-solr+branch_6x/lucene/build/core/test/J1/temp/lucene.index.TestIndexSorting_6E45BA611FCD7241-001 >[junit4] 2> NOTE: test params are: codec=Asserting(Lucene62): > {docs=PostingsFormat(name=MockRandom), norms=PostingsFormat(name=MockRandom), > positions=PostingsFormat(name=Memory doPackFST= true), > id=PostingsFormat(name=MockRandom), term_vectors=FSTOrd50}, > docValues:{multi_valued_long=DocValuesFormat(name=Direct), > double=DocValuesFormat(name=Lucene54), foo=DocValuesFormat(name=Direct), > numeric=DocValuesFormat(name=Lucene54), > positions=DocValuesFormat(name=Direct), > multi_valued_numeric=DocValuesFormat(name=Memory), > float=DocValuesFormat(name=Lucene54), int=DocValuesFormat(name=Memory), > long=DocValuesFormat(name=Lucene54), points=DocValuesFormat(name=Memory), > sorted=DocValuesFormat(name=Lucene54), > multi_valued_double=DocValuesFormat(name=Memory), > docs=DocValuesFormat(name=Memory), > multi_valued_string=DocValuesFormat(name=Memory), > norms=DocValuesFormat(name=Memory), bytes=DocValuesFormat(name=Memory), > binary=DocValuesFormat(name=Lucene54), id=DocValuesFormat(name=Memory), > multi_valued_int=DocValuesFormat(name=Lucene54), > multi_valued_bytes=DocValuesFormat(name=Lucene54), > multi_valued_float=DocValuesFormat(name=Lucene54), > term_vectors=DocValuesFormat(name=Lucene54)}, maxPointsInLeafNode=419, > maxMBSortInHeap=7.394324294878203, > sim=RandomSimilarity(queryNorm=true,coord=yes): {positions=DFR I(n)Z(0.3), > id=IB SPL-L1, term_vectors=DFR I(ne)B3(800.0)}, locale=ko-KR, > timezone=America/Cordoba >[junit4] 2> NOTE: Linux 3.12.60-52.54-default amd64/Oracle Corporation > 1.8.0_111 (64-bit)/cpus=4,threads=1,free=105782608,total=240648192 >[junit4] 2> NOTE: All tests run in this JVM: [TestPayloads, > TestSnapshotDeletionPolicy, TestDocValues, FuzzyTermOnShortTermsTest, > TestNoMergeScheduler, TestPointValues, TestSegmentInfos, TestStressIndexing2, > TestSimpleExplanationsOfNonMatches, TestPrefixInBooleanQuery, TestTermQuery, > TestSegmentMerger, TestByteArrayDataInput, TestTransactions, TestMultiFields, > TestNRTReaderCleanup, TestPackedInts, TestIndexWriterExceptions, > TestSleepingLockWrapper, TestBlockPostingsFormat2, TestLSBRadixSorter, > TestSwappedIndexFiles, TestIndexWriterCommit, TestPrefixRandom, > Test4GBStoredFields, TestFuzzyQuery, TestCodecUtil, > TestSimpleSearchEquivalence, TestWeakIdentityMap, TestIndexWriterOnDiskFull, > TestTopDocsMerge, TestOmitTf, TestDuelingCodecs, TestRAMDirectory, > TestFlushByRamOrCountsPolicy, TestDemo, TestSimpleExplanationsWithFillerDocs, > TestByteSlices, TestParallelLeafReader, TestSortedSetSelector, > TestBagOfPostings, TestDemoParallelLeafReader, TestTopFieldCollector, > TestSearchForDuplicates, TestStringHelper, TestTragicIndexWriterDeadlock, > TestDirectPacked, TestSloppyMath, TestPrefixQuery, TestSimpleFSDirectory, > TestFixedLengthBytesRefArray, TestIndexingSequenceNumbers, TestCharArraySet, > TestRollingBuffer, TestPagedBytes, TestFixedBitSet, TestAutomaton, > TestPhrasePrefixQuery, TestMultiPhraseEnum, TestBytesRefAttImpl, > TestDocsAndPositions, TestCharsRefBuilder, TestDeterminizeLexicon, > TestNIOFSDirectory, TestConju
[jira] [Updated] (LUCENE-7569) TestIndexSorting failures
[ https://issues.apache.org/jira/browse/LUCENE-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7569: - Attachment: (was: LUCENE-7574-1.patch) > TestIndexSorting failures > - > > Key: LUCENE-7569 > URL: https://issues.apache.org/jira/browse/LUCENE-7569 > Project: Lucene - Core > Issue Type: Bug >Reporter: Steve Rowe >Assignee: Michael McCandless >Priority: Blocker > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7569.patch > > > My Jenkins found two reproducing seeds on branch_6x - these look different, > but the failures happened on consecutive nightly runs: > {noformat} > Checking out Revision 535bf59a3b239f5c7bcd8c00f3e452c9b5e9b539 > (refs/remotes/origin/branch_6x) > [...] > [junit4] Suite: org.apache.lucene.index.TestIndexSorting >[junit4] 2> ??? 18, 2016 9:50:39 AM > com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler > uncaughtException >[junit4] 2> WARNING: Uncaught exception in thread: Thread[Lucene Merge > Thread #0,5,TGRP-TestIndexSorting] >[junit4] 2> org.apache.lucene.index.MergePolicy$MergeException: > java.lang.AssertionError: nextValue=4594289799775307848 vs > previous=4606302611760746829 >[junit4] 2>at > __randomizedtesting.SeedInfo.seed([5F8898DCABBFD056]:0) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648) >[junit4] 2> Caused by: java.lang.AssertionError: > nextValue=4594289799775307848 vs previous=4606302611760746829 >[junit4] 2>at > org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152) >[junit4] 2>at > org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470) >[junit4] 2>at > org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243) >[junit4] 2>at > org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:153) >[junit4] 2>at > org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:167) >[junit4] 2>at > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:111) >[junit4] 2>at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4320) >[junit4] 2>at > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3897) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) >[junit4] 2> >[junit4] 2> NOTE: download the large Jenkins line-docs file by running > 'ant get-jenkins-line-docs' in the lucene directory. >[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexSorting > -Dtests.method=testRandom3 -Dtests.seed=5F8898DCABBFD056 -Dtests.multiplier=2 > -Dtests.nightly=true -Dtests.slow=true > -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt > -Dtests.locale=he -Dtests.timezone=Canada/Central -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1 >[junit4] ERROR 14.9s J5 | TestIndexSorting.testRandom3 <<< >[junit4]> Throwable #1: > org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed >[junit4]>at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:748) >[junit4]>at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:762) >[junit4]>at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1566) >[junit4]>at > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1315) >[junit4]>at > org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2019) >[junit4]>at java.lang.Thread.run(Thread.java:745) >[junit4]> Caused by: java.lang.AssertionError: > nextValue=4594289799775307848 vs previous=4606302611760746829 >[junit4]>at > org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152) >[junit4]>at > org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470) >[junit4]>at > org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243) >[junit4]>at > org.apac
[jira] [Issue Comment Deleted] (LUCENE-7569) TestIndexSorting failures
[ https://issues.apache.org/jira/browse/LUCENE-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7569: - Comment: was deleted (was: This is a patch for branch_6x only. It fixes the test failure on TestIndexSorting#random3. master is not affected since the bug is due to the rewriting of a master patch to the 6x branch. The bug is that we reverse the sort twice when comparing two multi-valued field with no values.) > TestIndexSorting failures > - > > Key: LUCENE-7569 > URL: https://issues.apache.org/jira/browse/LUCENE-7569 > Project: Lucene - Core > Issue Type: Bug >Reporter: Steve Rowe >Assignee: Michael McCandless >Priority: Blocker > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7569.patch > > > My Jenkins found two reproducing seeds on branch_6x - these look different, > but the failures happened on consecutive nightly runs: > {noformat} > Checking out Revision 535bf59a3b239f5c7bcd8c00f3e452c9b5e9b539 > (refs/remotes/origin/branch_6x) > [...] > [junit4] Suite: org.apache.lucene.index.TestIndexSorting >[junit4] 2> ??? 18, 2016 9:50:39 AM > com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler > uncaughtException >[junit4] 2> WARNING: Uncaught exception in thread: Thread[Lucene Merge > Thread #0,5,TGRP-TestIndexSorting] >[junit4] 2> org.apache.lucene.index.MergePolicy$MergeException: > java.lang.AssertionError: nextValue=4594289799775307848 vs > previous=4606302611760746829 >[junit4] 2>at > __randomizedtesting.SeedInfo.seed([5F8898DCABBFD056]:0) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648) >[junit4] 2> Caused by: java.lang.AssertionError: > nextValue=4594289799775307848 vs previous=4606302611760746829 >[junit4] 2>at > org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152) >[junit4] 2>at > org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470) >[junit4] 2>at > org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243) >[junit4] 2>at > org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:153) >[junit4] 2>at > org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:167) >[junit4] 2>at > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:111) >[junit4] 2>at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4320) >[junit4] 2>at > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3897) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) >[junit4] 2> >[junit4] 2> NOTE: download the large Jenkins line-docs file by running > 'ant get-jenkins-line-docs' in the lucene directory. >[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexSorting > -Dtests.method=testRandom3 -Dtests.seed=5F8898DCABBFD056 -Dtests.multiplier=2 > -Dtests.nightly=true -Dtests.slow=true > -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt > -Dtests.locale=he -Dtests.timezone=Canada/Central -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1 >[junit4] ERROR 14.9s J5 | TestIndexSorting.testRandom3 <<< >[junit4]> Throwable #1: > org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed >[junit4]>at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:748) >[junit4]>at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:762) >[junit4]>at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1566) >[junit4]>at > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1315) >[junit4]>at > org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2019) >[junit4]>at java.lang.Thread.run(Thread.java:745) >[junit4]> Caused by: java.lang.AssertionError: > nextValue=4594289799775307848 vs previous=4606302611760746829 >[junit4]>at > org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152) >
[jira] [Updated] (LUCENE-7574) Another TestIndexSorting failures
[ https://issues.apache.org/jira/browse/LUCENE-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7574: - Attachment: LUCENE-7574-1.patch This is a patch for branch_6x only. It fixes the test failure on TestIndexSorting#random3. master is not affected since the bug is due to the rewriting of a master patch to the 6x branch. The bug is that we reverse the sort twice when comparing two multi-valued field with no values. > Another TestIndexSorting failures > - > > Key: LUCENE-7574 > URL: https://issues.apache.org/jira/browse/LUCENE-7574 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 6.x, master (7.0) >Reporter: Ferenczi Jim > Attachments: LUCENE-7574-1.patch > > > TestIndexSorting still fails with some seeds: > {noformat} >[junit4] Suite: org.apache.lucene.index.TestIndexSorting >[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexSorting > -Dtests.method=testRandom3 -Dtests.seed=6E45BA611FCD7241 -Dtests.slow=true > -Dtests.locale=ko-KR -Dtests.timezone=America/Cordoba -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1 >[junit4] FAILURE 0.64s J1 | TestIndexSorting.testRandom3 <<< >[junit4]> Throwable #1: org.junit.ComparisonFailure: expected:<[449]> > but was:<[650]> >[junit4]> at > __randomizedtesting.SeedInfo.seed([6E45BA611FCD7241:CC9DF4BB7B3F5B47]:0) >[junit4]> at > org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2264) >[junit4]> at java.lang.Thread.run(Thread.java:745) >[junit4] 2> NOTE: leaving temporary files on disk at: > /var/lib/jenkins/workspace/apache+lucene-solr+branch_6x/lucene/build/core/test/J1/temp/lucene.index.TestIndexSorting_6E45BA611FCD7241-001 >[junit4] 2> NOTE: test params are: codec=Asserting(Lucene62): > {docs=PostingsFormat(name=MockRandom), norms=PostingsFormat(name=MockRandom), > positions=PostingsFormat(name=Memory doPackFST= true), > id=PostingsFormat(name=MockRandom), term_vectors=FSTOrd50}, > docValues:{multi_valued_long=DocValuesFormat(name=Direct), > double=DocValuesFormat(name=Lucene54), foo=DocValuesFormat(name=Direct), > numeric=DocValuesFormat(name=Lucene54), > positions=DocValuesFormat(name=Direct), > multi_valued_numeric=DocValuesFormat(name=Memory), > float=DocValuesFormat(name=Lucene54), int=DocValuesFormat(name=Memory), > long=DocValuesFormat(name=Lucene54), points=DocValuesFormat(name=Memory), > sorted=DocValuesFormat(name=Lucene54), > multi_valued_double=DocValuesFormat(name=Memory), > docs=DocValuesFormat(name=Memory), > multi_valued_string=DocValuesFormat(name=Memory), > norms=DocValuesFormat(name=Memory), bytes=DocValuesFormat(name=Memory), > binary=DocValuesFormat(name=Lucene54), id=DocValuesFormat(name=Memory), > multi_valued_int=DocValuesFormat(name=Lucene54), > multi_valued_bytes=DocValuesFormat(name=Lucene54), > multi_valued_float=DocValuesFormat(name=Lucene54), > term_vectors=DocValuesFormat(name=Lucene54)}, maxPointsInLeafNode=419, > maxMBSortInHeap=7.394324294878203, > sim=RandomSimilarity(queryNorm=true,coord=yes): {positions=DFR I(n)Z(0.3), > id=IB SPL-L1, term_vectors=DFR I(ne)B3(800.0)}, locale=ko-KR, > timezone=America/Cordoba >[junit4] 2> NOTE: Linux 3.12.60-52.54-default amd64/Oracle Corporation > 1.8.0_111 (64-bit)/cpus=4,threads=1,free=105782608,total=240648192 >[junit4] 2> NOTE: All tests run in this JVM: [TestPayloads, > TestSnapshotDeletionPolicy, TestDocValues, FuzzyTermOnShortTermsTest, > TestNoMergeScheduler, TestPointValues, TestSegmentInfos, TestStressIndexing2, > TestSimpleExplanationsOfNonMatches, TestPrefixInBooleanQuery, TestTermQuery, > TestSegmentMerger, TestByteArrayDataInput, TestTransactions, TestMultiFields, > TestNRTReaderCleanup, TestPackedInts, TestIndexWriterExceptions, > TestSleepingLockWrapper, TestBlockPostingsFormat2, TestLSBRadixSorter, > TestSwappedIndexFiles, TestIndexWriterCommit, TestPrefixRandom, > Test4GBStoredFields, TestFuzzyQuery, TestCodecUtil, > TestSimpleSearchEquivalence, TestWeakIdentityMap, TestIndexWriterOnDiskFull, > TestTopDocsMerge, TestOmitTf, TestDuelingCodecs, TestRAMDirectory, > TestFlushByRamOrCountsPolicy, TestDemo, TestSimpleExplanationsWithFillerDocs, > TestByteSlices, TestParallelLeafReader, TestSortedSetSelector, > TestBagOfPostings, TestDemoParallelLeafReader, TestTopFieldCollector, > TestSearchForDuplicates, TestStringHelper, TestTragicIndexWriterDeadlock, > TestDirectPacked, TestSloppyMath, TestPrefixQuery, TestSimpleFSDirectory, > TestFixedLengthBytesRefArray, TestIndexingSequenceNumbers, TestCharArraySet, > TestRollingBuffer, TestPagedBytes, TestFixedBitSet, TestAutomaton, > TestPhrasePrefixQuery, TestMultiPhraseEnum, TestBytesRefAttImpl, > TestDocsAndPosi
[jira] [Updated] (LUCENE-7569) TestIndexSorting failures
[ https://issues.apache.org/jira/browse/LUCENE-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7569: - Attachment: LUCENE-7574-1.patch This is a patch for branch_6x only. It fixes the test failure on TestIndexSorting#random3. master is not affected since the bug is due to the rewriting of a master patch to the 6x branch. The bug is that we reverse the sort twice when comparing two multi-valued field with no values. > TestIndexSorting failures > - > > Key: LUCENE-7569 > URL: https://issues.apache.org/jira/browse/LUCENE-7569 > Project: Lucene - Core > Issue Type: Bug >Reporter: Steve Rowe >Assignee: Michael McCandless >Priority: Blocker > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7569.patch, LUCENE-7574-1.patch > > > My Jenkins found two reproducing seeds on branch_6x - these look different, > but the failures happened on consecutive nightly runs: > {noformat} > Checking out Revision 535bf59a3b239f5c7bcd8c00f3e452c9b5e9b539 > (refs/remotes/origin/branch_6x) > [...] > [junit4] Suite: org.apache.lucene.index.TestIndexSorting >[junit4] 2> ??? 18, 2016 9:50:39 AM > com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler > uncaughtException >[junit4] 2> WARNING: Uncaught exception in thread: Thread[Lucene Merge > Thread #0,5,TGRP-TestIndexSorting] >[junit4] 2> org.apache.lucene.index.MergePolicy$MergeException: > java.lang.AssertionError: nextValue=4594289799775307848 vs > previous=4606302611760746829 >[junit4] 2>at > __randomizedtesting.SeedInfo.seed([5F8898DCABBFD056]:0) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648) >[junit4] 2> Caused by: java.lang.AssertionError: > nextValue=4594289799775307848 vs previous=4606302611760746829 >[junit4] 2>at > org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152) >[junit4] 2>at > org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470) >[junit4] 2>at > org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243) >[junit4] 2>at > org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:153) >[junit4] 2>at > org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:167) >[junit4] 2>at > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:111) >[junit4] 2>at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4320) >[junit4] 2>at > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3897) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) >[junit4] 2> >[junit4] 2> NOTE: download the large Jenkins line-docs file by running > 'ant get-jenkins-line-docs' in the lucene directory. >[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexSorting > -Dtests.method=testRandom3 -Dtests.seed=5F8898DCABBFD056 -Dtests.multiplier=2 > -Dtests.nightly=true -Dtests.slow=true > -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt > -Dtests.locale=he -Dtests.timezone=Canada/Central -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1 >[junit4] ERROR 14.9s J5 | TestIndexSorting.testRandom3 <<< >[junit4]> Throwable #1: > org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed >[junit4]>at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:748) >[junit4]>at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:762) >[junit4]>at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1566) >[junit4]>at > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1315) >[junit4]>at > org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2019) >[junit4]>at java.lang.Thread.run(Thread.java:745) >[junit4]> Caused by: java.lang.AssertionError: > nextValue=4594289799775307848 vs previous=4606302611760746829 >[junit4]>at > org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocV
[jira] [Created] (LUCENE-7574) Another TestIndexSorting failures
Ferenczi Jim created LUCENE-7574: Summary: Another TestIndexSorting failures Key: LUCENE-7574 URL: https://issues.apache.org/jira/browse/LUCENE-7574 Project: Lucene - Core Issue Type: Bug Components: core/index Affects Versions: 6.x, master (7.0) Reporter: Ferenczi Jim TestIndexSorting still fails with some seeds: {noformat} [junit4] Suite: org.apache.lucene.index.TestIndexSorting [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexSorting -Dtests.method=testRandom3 -Dtests.seed=6E45BA611FCD7241 -Dtests.slow=true -Dtests.locale=ko-KR -Dtests.timezone=America/Cordoba -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] FAILURE 0.64s J1 | TestIndexSorting.testRandom3 <<< [junit4]> Throwable #1: org.junit.ComparisonFailure: expected:<[449]> but was:<[650]> [junit4]>at __randomizedtesting.SeedInfo.seed([6E45BA611FCD7241:CC9DF4BB7B3F5B47]:0) [junit4]>at org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2264) [junit4]>at java.lang.Thread.run(Thread.java:745) [junit4] 2> NOTE: leaving temporary files on disk at: /var/lib/jenkins/workspace/apache+lucene-solr+branch_6x/lucene/build/core/test/J1/temp/lucene.index.TestIndexSorting_6E45BA611FCD7241-001 [junit4] 2> NOTE: test params are: codec=Asserting(Lucene62): {docs=PostingsFormat(name=MockRandom), norms=PostingsFormat(name=MockRandom), positions=PostingsFormat(name=Memory doPackFST= true), id=PostingsFormat(name=MockRandom), term_vectors=FSTOrd50}, docValues:{multi_valued_long=DocValuesFormat(name=Direct), double=DocValuesFormat(name=Lucene54), foo=DocValuesFormat(name=Direct), numeric=DocValuesFormat(name=Lucene54), positions=DocValuesFormat(name=Direct), multi_valued_numeric=DocValuesFormat(name=Memory), float=DocValuesFormat(name=Lucene54), int=DocValuesFormat(name=Memory), long=DocValuesFormat(name=Lucene54), points=DocValuesFormat(name=Memory), sorted=DocValuesFormat(name=Lucene54), multi_valued_double=DocValuesFormat(name=Memory), docs=DocValuesFormat(name=Memory), multi_valued_string=DocValuesFormat(name=Memory), norms=DocValuesFormat(name=Memory), bytes=DocValuesFormat(name=Memory), binary=DocValuesFormat(name=Lucene54), id=DocValuesFormat(name=Memory), multi_valued_int=DocValuesFormat(name=Lucene54), multi_valued_bytes=DocValuesFormat(name=Lucene54), multi_valued_float=DocValuesFormat(name=Lucene54), term_vectors=DocValuesFormat(name=Lucene54)}, maxPointsInLeafNode=419, maxMBSortInHeap=7.394324294878203, sim=RandomSimilarity(queryNorm=true,coord=yes): {positions=DFR I(n)Z(0.3), id=IB SPL-L1, term_vectors=DFR I(ne)B3(800.0)}, locale=ko-KR, timezone=America/Cordoba [junit4] 2> NOTE: Linux 3.12.60-52.54-default amd64/Oracle Corporation 1.8.0_111 (64-bit)/cpus=4,threads=1,free=105782608,total=240648192 [junit4] 2> NOTE: All tests run in this JVM: [TestPayloads, TestSnapshotDeletionPolicy, TestDocValues, FuzzyTermOnShortTermsTest, TestNoMergeScheduler, TestPointValues, TestSegmentInfos, TestStressIndexing2, TestSimpleExplanationsOfNonMatches, TestPrefixInBooleanQuery, TestTermQuery, TestSegmentMerger, TestByteArrayDataInput, TestTransactions, TestMultiFields, TestNRTReaderCleanup, TestPackedInts, TestIndexWriterExceptions, TestSleepingLockWrapper, TestBlockPostingsFormat2, TestLSBRadixSorter, TestSwappedIndexFiles, TestIndexWriterCommit, TestPrefixRandom, Test4GBStoredFields, TestFuzzyQuery, TestCodecUtil, TestSimpleSearchEquivalence, TestWeakIdentityMap, TestIndexWriterOnDiskFull, TestTopDocsMerge, TestOmitTf, TestDuelingCodecs, TestRAMDirectory, TestFlushByRamOrCountsPolicy, TestDemo, TestSimpleExplanationsWithFillerDocs, TestByteSlices, TestParallelLeafReader, TestSortedSetSelector, TestBagOfPostings, TestDemoParallelLeafReader, TestTopFieldCollector, TestSearchForDuplicates, TestStringHelper, TestTragicIndexWriterDeadlock, TestDirectPacked, TestSloppyMath, TestPrefixQuery, TestSimpleFSDirectory, TestFixedLengthBytesRefArray, TestIndexingSequenceNumbers, TestCharArraySet, TestRollingBuffer, TestPagedBytes, TestFixedBitSet, TestAutomaton, TestPhrasePrefixQuery, TestMultiPhraseEnum, TestBytesRefAttImpl, TestDocsAndPositions, TestCharsRefBuilder, TestDeterminizeLexicon, TestNIOFSDirectory, TestConjunctionDISI, TestLiveFieldValues, TestBoolean2, TestHighCompressionMode, TestIndexWriterUnicode, TestCachingCollector, TestMultiDocValues, TestFilterWeight, TestPerFieldPostingsFormat2, TestBytesRefHash, TestBooleanQueryVisitSubscorers, TestMatchAllDocsQuery, TestBinaryTerms, TestPositionIncrement, TestNumericTokenStream, TestDateTools, Test2BPostings, TestBinaryDocument, TestBooleanScorer, TestNot, TestReaderClosed, TestNGramPhraseQuery, TestSimpleAttributeImpl, Test2BPostingsBytes, Test2BTerms, TestReusableStringReader, TestLucene50StoredFieldsF
[jira] [Commented] (LUCENE-7569) TestIndexSorting failures
[ https://issues.apache.org/jira/browse/LUCENE-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15693248#comment-15693248 ] Ferenczi Jim commented on LUCENE-7569: -- {quote}I'm not sure why I didn't realize we must do so as well for the sorted set case. {quote} I am not sure it is needed. The iterations are stateless in 6x since it's just an iteration over the docids and the norm ids and we call setDocument on the underlying doc values reader each time we need to access it ? > TestIndexSorting failures > - > > Key: LUCENE-7569 > URL: https://issues.apache.org/jira/browse/LUCENE-7569 > Project: Lucene - Core > Issue Type: Bug >Reporter: Steve Rowe >Assignee: Michael McCandless >Priority: Blocker > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7569.patch > > > My Jenkins found two reproducing seeds on branch_6x - these look different, > but the failures happened on consecutive nightly runs: > {noformat} > Checking out Revision 535bf59a3b239f5c7bcd8c00f3e452c9b5e9b539 > (refs/remotes/origin/branch_6x) > [...] > [junit4] Suite: org.apache.lucene.index.TestIndexSorting >[junit4] 2> ??? 18, 2016 9:50:39 AM > com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler > uncaughtException >[junit4] 2> WARNING: Uncaught exception in thread: Thread[Lucene Merge > Thread #0,5,TGRP-TestIndexSorting] >[junit4] 2> org.apache.lucene.index.MergePolicy$MergeException: > java.lang.AssertionError: nextValue=4594289799775307848 vs > previous=4606302611760746829 >[junit4] 2>at > __randomizedtesting.SeedInfo.seed([5F8898DCABBFD056]:0) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648) >[junit4] 2> Caused by: java.lang.AssertionError: > nextValue=4594289799775307848 vs previous=4606302611760746829 >[junit4] 2>at > org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152) >[junit4] 2>at > org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470) >[junit4] 2>at > org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243) >[junit4] 2>at > org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:153) >[junit4] 2>at > org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:167) >[junit4] 2>at > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:111) >[junit4] 2>at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4320) >[junit4] 2>at > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3897) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) >[junit4] 2> >[junit4] 2> NOTE: download the large Jenkins line-docs file by running > 'ant get-jenkins-line-docs' in the lucene directory. >[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexSorting > -Dtests.method=testRandom3 -Dtests.seed=5F8898DCABBFD056 -Dtests.multiplier=2 > -Dtests.nightly=true -Dtests.slow=true > -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt > -Dtests.locale=he -Dtests.timezone=Canada/Central -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1 >[junit4] ERROR 14.9s J5 | TestIndexSorting.testRandom3 <<< >[junit4]> Throwable #1: > org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed >[junit4]>at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:748) >[junit4]>at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:762) >[junit4]>at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1566) >[junit4]>at > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1315) >[junit4]>at > org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2019) >[junit4]>at java.lang.Thread.run(Thread.java:745) >[junit4]> Caused by: java.lang.AssertionError: > nextValue=4594289799775307848 vs previous=4606302611760746829 >[junit4]>at > org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addS
[jira] [Updated] (LUCENE-7569) TestIndexSorting failures
[ https://issues.apache.org/jira/browse/LUCENE-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7569: - Attachment: LUCENE-7569.patch It turns out that the problem is a just a test bug related to MockRandomMergePolicy. This merge policy is only used in tests and randomly wraps the reader to be merged in a SlowCodecReaderWrapper in order to deactivate the bulk merging. The test pass if I add a MergeReaderWrapper around the original reader. This makes index sorting happy since the docvalues instances are now re-created each time. Bottom line is that merging of multi-valued docvalues works fine even when index sorting is on except in these tests where the MockRandomMergePolicy disables the bulk merge. I've attached a patch that fixes this bug and the other test bug on testNumericAlreadySorted. The patch is for the master branch (since testNumericAlreadySorted should also fail on this branch) but the backport should be straightforward. > TestIndexSorting failures > - > > Key: LUCENE-7569 > URL: https://issues.apache.org/jira/browse/LUCENE-7569 > Project: Lucene - Core > Issue Type: Bug >Reporter: Steve Rowe >Assignee: Michael McCandless >Priority: Blocker > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7569.patch > > > My Jenkins found two reproducing seeds on branch_6x - these look different, > but the failures happened on consecutive nightly runs: > {noformat} > Checking out Revision 535bf59a3b239f5c7bcd8c00f3e452c9b5e9b539 > (refs/remotes/origin/branch_6x) > [...] > [junit4] Suite: org.apache.lucene.index.TestIndexSorting >[junit4] 2> ??? 18, 2016 9:50:39 AM > com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler > uncaughtException >[junit4] 2> WARNING: Uncaught exception in thread: Thread[Lucene Merge > Thread #0,5,TGRP-TestIndexSorting] >[junit4] 2> org.apache.lucene.index.MergePolicy$MergeException: > java.lang.AssertionError: nextValue=4594289799775307848 vs > previous=4606302611760746829 >[junit4] 2>at > __randomizedtesting.SeedInfo.seed([5F8898DCABBFD056]:0) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648) >[junit4] 2> Caused by: java.lang.AssertionError: > nextValue=4594289799775307848 vs previous=4606302611760746829 >[junit4] 2>at > org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152) >[junit4] 2>at > org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470) >[junit4] 2>at > org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243) >[junit4] 2>at > org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:153) >[junit4] 2>at > org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:167) >[junit4] 2>at > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:111) >[junit4] 2>at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4320) >[junit4] 2>at > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3897) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) >[junit4] 2> >[junit4] 2> NOTE: download the large Jenkins line-docs file by running > 'ant get-jenkins-line-docs' in the lucene directory. >[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexSorting > -Dtests.method=testRandom3 -Dtests.seed=5F8898DCABBFD056 -Dtests.multiplier=2 > -Dtests.nightly=true -Dtests.slow=true > -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt > -Dtests.locale=he -Dtests.timezone=Canada/Central -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1 >[junit4] ERROR 14.9s J5 | TestIndexSorting.testRandom3 <<< >[junit4]> Throwable #1: > org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed >[junit4]>at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:748) >[junit4]>at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:762) >[junit4]>at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1566) >[junit4]>at > org.apache.l
[jira] [Commented] (LUCENE-7569) TestIndexSorting failures
[ https://issues.apache.org/jira/browse/LUCENE-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691060#comment-15691060 ] Ferenczi Jim commented on LUCENE-7569: -- Thanks [~sar...@syr.edu]. We've investigated this with [~jpountz] and found the issue. The index sorting during the merge uses a SortingLeafReader which uses a cache per thread for the doc value readers. This breaks the merging of SortedSetDocValues and SortedNumericDocValues since we need to iterate two instances in parallel of these doc values during the merge (see DocValuesConsumer.mergeSortedNumericField). [~jpountz]'s idea to fix this bug is to rewrite the SortingLeafReader to a SortingCodecReader. The SortingCodecReader would never use the same instance when creating a DocValues reader. The bug is only in 6x since in master the doc value readers are now iterators that are re-created when getDocValues is called. The second issue about already sorted index is just a test bug that appears when the random merge policy picks a merge factor of 2. I'll send a patch for these two issues shortly. The first issue is problematic though because it means that indices that use index sorting (on any field, multivalued or not) in 6x are not able to merge multivalued doc values properly. > TestIndexSorting failures > - > > Key: LUCENE-7569 > URL: https://issues.apache.org/jira/browse/LUCENE-7569 > Project: Lucene - Core > Issue Type: Bug >Reporter: Steve Rowe >Assignee: Michael McCandless >Priority: Blocker > Fix For: master (7.0), 6.4 > > > My Jenkins found two reproducing seeds on branch_6x - these look different, > but the failures happened on consecutive nightly runs: > {noformat} > Checking out Revision 535bf59a3b239f5c7bcd8c00f3e452c9b5e9b539 > (refs/remotes/origin/branch_6x) > [...] > [junit4] Suite: org.apache.lucene.index.TestIndexSorting >[junit4] 2> ??? 18, 2016 9:50:39 AM > com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler > uncaughtException >[junit4] 2> WARNING: Uncaught exception in thread: Thread[Lucene Merge > Thread #0,5,TGRP-TestIndexSorting] >[junit4] 2> org.apache.lucene.index.MergePolicy$MergeException: > java.lang.AssertionError: nextValue=4594289799775307848 vs > previous=4606302611760746829 >[junit4] 2>at > __randomizedtesting.SeedInfo.seed([5F8898DCABBFD056]:0) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648) >[junit4] 2> Caused by: java.lang.AssertionError: > nextValue=4594289799775307848 vs previous=4606302611760746829 >[junit4] 2>at > org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152) >[junit4] 2>at > org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470) >[junit4] 2>at > org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243) >[junit4] 2>at > org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:153) >[junit4] 2>at > org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:167) >[junit4] 2>at > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:111) >[junit4] 2>at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4320) >[junit4] 2>at > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3897) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) >[junit4] 2> >[junit4] 2> NOTE: download the large Jenkins line-docs file by running > 'ant get-jenkins-line-docs' in the lucene directory. >[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexSorting > -Dtests.method=testRandom3 -Dtests.seed=5F8898DCABBFD056 -Dtests.multiplier=2 > -Dtests.nightly=true -Dtests.slow=true > -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt > -Dtests.locale=he -Dtests.timezone=Canada/Central -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1 >[junit4] ERROR 14.9s J5 | TestIndexSorting.testRandom3 <<< >[junit4]> Throwable #1: > org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed >[junit4]>at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWri
[jira] [Commented] (LUCENE-7569) TestIndexSorting.testRandom3() failures
[ https://issues.apache.org/jira/browse/LUCENE-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15690219#comment-15690219 ] Ferenczi Jim commented on LUCENE-7569: -- I am looking at it. Seems like a bug when sorting multi_valued doc values. > TestIndexSorting.testRandom3() failures > --- > > Key: LUCENE-7569 > URL: https://issues.apache.org/jira/browse/LUCENE-7569 > Project: Lucene - Core > Issue Type: Bug >Reporter: Steve Rowe >Assignee: Michael McCandless >Priority: Blocker > Fix For: master (7.0), 6.4 > > > My Jenkins found two reproducing seeds on branch_6x - these look different, > but the failures happened on consecutive nightly runs: > {noformat} > Checking out Revision 535bf59a3b239f5c7bcd8c00f3e452c9b5e9b539 > (refs/remotes/origin/branch_6x) > [...] > [junit4] Suite: org.apache.lucene.index.TestIndexSorting >[junit4] 2> ??? 18, 2016 9:50:39 AM > com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler > uncaughtException >[junit4] 2> WARNING: Uncaught exception in thread: Thread[Lucene Merge > Thread #0,5,TGRP-TestIndexSorting] >[junit4] 2> org.apache.lucene.index.MergePolicy$MergeException: > java.lang.AssertionError: nextValue=4594289799775307848 vs > previous=4606302611760746829 >[junit4] 2>at > __randomizedtesting.SeedInfo.seed([5F8898DCABBFD056]:0) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648) >[junit4] 2> Caused by: java.lang.AssertionError: > nextValue=4594289799775307848 vs previous=4606302611760746829 >[junit4] 2>at > org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152) >[junit4] 2>at > org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470) >[junit4] 2>at > org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243) >[junit4] 2>at > org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:153) >[junit4] 2>at > org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:167) >[junit4] 2>at > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:111) >[junit4] 2>at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4320) >[junit4] 2>at > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3897) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) >[junit4] 2>at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) >[junit4] 2> >[junit4] 2> NOTE: download the large Jenkins line-docs file by running > 'ant get-jenkins-line-docs' in the lucene directory. >[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexSorting > -Dtests.method=testRandom3 -Dtests.seed=5F8898DCABBFD056 -Dtests.multiplier=2 > -Dtests.nightly=true -Dtests.slow=true > -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt > -Dtests.locale=he -Dtests.timezone=Canada/Central -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1 >[junit4] ERROR 14.9s J5 | TestIndexSorting.testRandom3 <<< >[junit4]> Throwable #1: > org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed >[junit4]>at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:748) >[junit4]>at > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:762) >[junit4]>at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1566) >[junit4]>at > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1315) >[junit4]>at > org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2019) >[junit4]>at java.lang.Thread.run(Thread.java:745) >[junit4]> Caused by: java.lang.AssertionError: > nextValue=4594289799775307848 vs previous=4606302611760746829 >[junit4]>at > org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152) >[junit4]>at > org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470) >[junit4]>at > org.apache.lucene.codecs.DocValuesConsumer.mer
[jira] [Updated] (LUCENE-7568) Optimize merge when index sorting is used but the index is already sorted
[ https://issues.apache.org/jira/browse/LUCENE-7568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7568: - Attachment: LUCENE-7568.patch Thanks for the review [~mikemccand]]. I've modified the test with your suggestions. I am not sure I use the FilterCodec appropriately though (especially how I choose the delegating codec), can you take a look ? > Optimize merge when index sorting is used but the index is already sorted > - > > Key: LUCENE-7568 > URL: https://issues.apache.org/jira/browse/LUCENE-7568 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Ferenczi Jim > Attachments: LUCENE-7568.patch, LUCENE-7568.patch > > > When the index sorting is defined a lot of optimizations are disabled during > the merge. For instance the bulk merge of the compressing stored fields is > disabled since documents are not merged sequentially. Though it can happen > that index sorting is enabled but the index is already in sorted order (the > sort field is not filled or filled with the same value for all documents). In > such case we can detect that the sort is not needed and activate the merge > optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7568) Optimize merge when index sorting is used but the index is already sorted
[ https://issues.apache.org/jira/browse/LUCENE-7568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7568: - Attachment: LUCENE-7568.patch Here is a first patch that detects if an index is already sorted and makes this information available through MergeState. This information is then used by all the merge strategy to activate (or not) some optimizations. > Optimize merge when index sorting is used but the index is already sorted > - > > Key: LUCENE-7568 > URL: https://issues.apache.org/jira/browse/LUCENE-7568 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Ferenczi Jim > Attachments: LUCENE-7568.patch > > > When the index sorting is defined a lot of optimizations are disabled during > the merge. For instance the bulk merge of the compressing stored fields is > disabled since documents are not merged sequentially. Though it can happen > that index sorting is enabled but the index is already in sorted order (the > sort field is not filled or filled with the same value for all documents). In > such case we can detect that the sort is not needed and activate the merge > optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-7568) Optimize merge when index sorting is used but the index is already sorted
Ferenczi Jim created LUCENE-7568: Summary: Optimize merge when index sorting is used but the index is already sorted Key: LUCENE-7568 URL: https://issues.apache.org/jira/browse/LUCENE-7568 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: Ferenczi Jim When the index sorting is defined a lot of optimizations are disabled during the merge. For instance the bulk merge of the compressing stored fields is disabled since documents are not merged sequentially. Though it can happen that index sorting is enabled but the index is already in sorted order (the sort field is not filled or filled with the same value for all documents). In such case we can detect that the sort is not needed and activate the merge optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7537) Add multi valued field support to index sorting
[ https://issues.apache.org/jira/browse/LUCENE-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7537: - Attachment: LUCENE-7537.patch Oh right "sorted_string" is ambiguous. Here is another patch with the renaming to "multi_valued" for string and numerics. Thanks [~mikemccand] > Add multi valued field support to index sorting > --- > > Key: LUCENE-7537 > URL: https://issues.apache.org/jira/browse/LUCENE-7537 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Ferenczi Jim > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7537.patch, LUCENE-7537.patch, LUCENE-7537.patch, > LUCENE-7537.patch, LUCENE-7537.patch > > > Today index sorting can be done on single valued field through the > NumericDocValues (for numerics) and SortedDocValues (for strings). > I'd like to add the ability to sort on multi valued fields. Since index > sorting does not accept custom comparator we could just take the minimum > value of each document for an ascending sort and the maximum value for a > descending sort. > This way we could handle all cases instead of throwing an exception during a > merge when we encounter a multi valued DVs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7537) Add multi valued field support to index sorting
[ https://issues.apache.org/jira/browse/LUCENE-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7537: - Attachment: LUCENE-7537.patch Thanks [~mikemccand], I attached a new patch that addresses your comments. I can also make another path for 6.4 if needed. > Add multi valued field support to index sorting > --- > > Key: LUCENE-7537 > URL: https://issues.apache.org/jira/browse/LUCENE-7537 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Ferenczi Jim > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7537.patch, LUCENE-7537.patch, LUCENE-7537.patch, > LUCENE-7537.patch > > > Today index sorting can be done on single valued field through the > NumericDocValues (for numerics) and SortedDocValues (for strings). > I'd like to add the ability to sort on multi valued fields. Since index > sorting does not accept custom comparator we could just take the minimum > value of each document for an ascending sort and the maximum value for a > descending sort. > This way we could handle all cases instead of throwing an exception during a > merge when we encounter a multi valued DVs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7537) Add multi valued field support to index sorting
[ https://issues.apache.org/jira/browse/LUCENE-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7537: - Attachment: LUCENE-7537.patch Thanks [~mikemccand]. Sorry I didn't ran the full tests for the last patch. I've attached a new one which passes all the tests. I fixed the exceptions and the SimpleText codec. Could you take another look ? > Add multi valued field support to index sorting > --- > > Key: LUCENE-7537 > URL: https://issues.apache.org/jira/browse/LUCENE-7537 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Ferenczi Jim > Attachments: LUCENE-7537.patch, LUCENE-7537.patch, LUCENE-7537.patch > > > Today index sorting can be done on single valued field through the > NumericDocValues (for numerics) and SortedDocValues (for strings). > I'd like to add the ability to sort on multi valued fields. Since index > sorting does not accept custom comparator we could just take the minimum > value of each document for an ascending sort and the maximum value for a > descending sort. > This way we could handle all cases instead of throwing an exception during a > merge when we encounter a multi valued DVs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-7552) FastVectorHighlighter ignores position in PhraseQuery
[ https://issues.apache.org/jira/browse/LUCENE-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim closed LUCENE-7552. Resolution: Duplicate > FastVectorHighlighter ignores position in PhraseQuery > - > > Key: LUCENE-7552 > URL: https://issues.apache.org/jira/browse/LUCENE-7552 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Reporter: Ferenczi Jim >Priority: Minor > > The PhraseQuery contains a list of terms and the positions for each term. The > FVH ignores the term position and assumes that a phrase query is always > dense. As a result phrase query with gaps are not highlighted at all. This is > problematic for text fields that use a FilteringTokenFilter. This token > filter removes tokens but preserves the position increment of each removal. > Bottom line is that using this token filter breaks the highlighting of phrase > query that contains filtered tokens. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-7552) FastVectorHighlighter ignores position in PhraseQuery
Ferenczi Jim created LUCENE-7552: Summary: FastVectorHighlighter ignores position in PhraseQuery Key: LUCENE-7552 URL: https://issues.apache.org/jira/browse/LUCENE-7552 Project: Lucene - Core Issue Type: Bug Components: core/search Reporter: Ferenczi Jim Priority: Minor The PhraseQuery contains a list of terms and the positions for each term. The FVH ignores the term position and assumes that a phrase query is always dense. As a result phrase query with gaps are not highlighted at all. This is problematic for text fields that use a FilteringTokenFilter. This token filter removes tokens but preserves the position increment of each removal. Bottom line is that using this token filter breaks the highlighting of phrase query that contains filtered tokens. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-7551) FastVectorHighlighter ignores position in PhraseQuery
Ferenczi Jim created LUCENE-7551: Summary: FastVectorHighlighter ignores position in PhraseQuery Key: LUCENE-7551 URL: https://issues.apache.org/jira/browse/LUCENE-7551 Project: Lucene - Core Issue Type: Bug Components: core/search Reporter: Ferenczi Jim Priority: Minor The PhraseQuery contains a list of terms and the positions for each term. The FVH ignores the term position and assumes that a phrase query is always dense. As a result phrase query with gaps are not highlighted at all. This is problematic for text fields that use a FilteringTokenFilter. This token filter removes tokens but preserves the position increment of each removal. Bottom line is that using this token filter breaks the highlighting of phrase query that contains filtered tokens. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7537) Add multi valued field support to index sorting
[ https://issues.apache.org/jira/browse/LUCENE-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7537: - Attachment: LUCENE-7537.patch I published a new patch which adds index sort support for SortedSetSortField and SortedNumericSortField. [~mikemccand] can you take a look ? > Add multi valued field support to index sorting > --- > > Key: LUCENE-7537 > URL: https://issues.apache.org/jira/browse/LUCENE-7537 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Ferenczi Jim > Attachments: LUCENE-7537.patch, LUCENE-7537.patch > > > Today index sorting can be done on single valued field through the > NumericDocValues (for numerics) and SortedDocValues (for strings). > I'd like to add the ability to sort on multi valued fields. Since index > sorting does not accept custom comparator we could just take the minimum > value of each document for an ascending sort and the maximum value for a > descending sort. > This way we could handle all cases instead of throwing an exception during a > merge when we encounter a multi valued DVs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7537) Add multi valued field support to index sorting
[ https://issues.apache.org/jira/browse/LUCENE-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15647978#comment-15647978 ] Ferenczi Jim commented on LUCENE-7537: -- Thanks [~mikemccand]. I tried this approach and then added the types to clean up the serialization and the index sorting check ;) I can totally revert to the first version which does what you say. > Add multi valued field support to index sorting > --- > > Key: LUCENE-7537 > URL: https://issues.apache.org/jira/browse/LUCENE-7537 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Ferenczi Jim > Attachments: LUCENE-7537.patch > > > Today index sorting can be done on single valued field through the > NumericDocValues (for numerics) and SortedDocValues (for strings). > I'd like to add the ability to sort on multi valued fields. Since index > sorting does not accept custom comparator we could just take the minimum > value of each document for an ascending sort and the maximum value for a > descending sort. > This way we could handle all cases instead of throwing an exception during a > merge when we encounter a multi valued DVs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7537) Add multi valued field support to index sorting
[ https://issues.apache.org/jira/browse/LUCENE-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15647858#comment-15647858 ] Ferenczi Jim commented on LUCENE-7537: -- > The new types do not look useful to me? It's to differentiate the underlying DVs and also because I didn't want to change the expectation of the native sort. Though I am totally for a single type that accepts both DVs if changing the SortField native types is ok. > For instance, DocValues.getSortedSet falls back to > LeafReader.getSortedDocValues if the reader does not have SORTED_SET doc > values, so all the code that you protected under eg. if (sortField.getType() > == SortField.Type.SORTED_STRING) would also work with single-valued (SORTED) > doc values (same for SORTED_NUMERIC and NUMERIC doc values). The leniency is here to catch SortedSetDocValues that ends up with a single value per field. But yes it's another point for the merged type. > Add multi valued field support to index sorting > --- > > Key: LUCENE-7537 > URL: https://issues.apache.org/jira/browse/LUCENE-7537 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Ferenczi Jim > Attachments: LUCENE-7537.patch > > > Today index sorting can be done on single valued field through the > NumericDocValues (for numerics) and SortedDocValues (for strings). > I'd like to add the ability to sort on multi valued fields. Since index > sorting does not accept custom comparator we could just take the minimum > value of each document for an ascending sort and the maximum value for a > descending sort. > This way we could handle all cases instead of throwing an exception during a > merge when we encounter a multi valued DVs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7537) Add multi valued field support to index sorting
[ https://issues.apache.org/jira/browse/LUCENE-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7537: - Attachment: LUCENE-7537.patch Here is a simple patch that adds the support for multi valued sort directly in SortField. It defines 5 new sort types: sorted_string, sorted_long, sorted_double, sorted_float, sorted_int and uses the Sorted{Set|Numeric}Selector for sorting. The natural order picks the minimum value of the list for each document and the reverse order picks the maximum. This patch also fixes a small bug which showed up in unit tests when using an index sorting with reverse sort and a missing value. > Add multi valued field support to index sorting > --- > > Key: LUCENE-7537 > URL: https://issues.apache.org/jira/browse/LUCENE-7537 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Ferenczi Jim > Attachments: LUCENE-7537.patch > > > Today index sorting can be done on single valued field through the > NumericDocValues (for numerics) and SortedDocValues (for strings). > I'd like to add the ability to sort on multi valued fields. Since index > sorting does not accept custom comparator we could just take the minimum > value of each document for an ascending sort and the maximum value for a > descending sort. > This way we could handle all cases instead of throwing an exception during a > merge when we encounter a multi valued DVs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7537) Add multi valued field support to index sorting
[ https://issues.apache.org/jira/browse/LUCENE-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636366#comment-15636366 ] Ferenczi Jim commented on LUCENE-7537: -- Oh I've already started to work on a patch with the logic described above ;) I'll post it shortly. Thanks [~mikemccand]. > Add multi valued field support to index sorting > --- > > Key: LUCENE-7537 > URL: https://issues.apache.org/jira/browse/LUCENE-7537 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Ferenczi Jim >Assignee: Michael McCandless > > Today index sorting can be done on single valued field through the > NumericDocValues (for numerics) and SortedDocValues (for strings). > I'd like to add the ability to sort on multi valued fields. Since index > sorting does not accept custom comparator we could just take the minimum > value of each document for an ascending sort and the maximum value for a > descending sort. > This way we could handle all cases instead of throwing an exception during a > merge when we encounter a multi valued DVs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-7537) Add multi valued field support to index sorting
Ferenczi Jim created LUCENE-7537: Summary: Add multi valued field support to index sorting Key: LUCENE-7537 URL: https://issues.apache.org/jira/browse/LUCENE-7537 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: Ferenczi Jim Today index sorting can be done on single valued field through the NumericDocValues (for numerics) and SortedDocValues (for strings). I'd like to add the ability to sort on multi valued fields. Since index sorting does not accept custom comparator we could just take the minimum value of each document for an ascending sort and the maximum value for a descending sort. This way we could handle all cases instead of throwing an exception during a merge when we encounter a multi valued DVs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7484) FastVectorHighlighter fails to highlight SynonymQuery
[ https://issues.apache.org/jira/browse/LUCENE-7484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15562279#comment-15562279 ] Ferenczi Jim commented on LUCENE-7484: -- Thanks [~mikemccand]. That was fast ! > FastVectorHighlighter fails to highlight SynonymQuery > - > > Key: LUCENE-7484 > URL: https://issues.apache.org/jira/browse/LUCENE-7484 > Project: Lucene - Core > Issue Type: Bug > Components: core/termvectors >Affects Versions: 6.x, master (7.0) >Reporter: Ferenczi Jim > Fix For: master (7.0), 6.3 > > Attachments: LUCENE-7484.patch > > > SynonymQuery are ignored by the FastVectorHighlighter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7484) FastVectorHighlighter fails to highlight SynonymQuery
[ https://issues.apache.org/jira/browse/LUCENE-7484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7484: - Attachment: LUCENE-7484.patch > FastVectorHighlighter fails to highlight SynonymQuery > - > > Key: LUCENE-7484 > URL: https://issues.apache.org/jira/browse/LUCENE-7484 > Project: Lucene - Core > Issue Type: Bug > Components: core/termvectors >Affects Versions: 6.x, master (7.0) >Reporter: Ferenczi Jim > Attachments: LUCENE-7484.patch > > > SynonymQuery are ignored by the FastVectorHighlighter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-7484) FastVectorHighlighter fails to highlight SynonymQuery
Ferenczi Jim created LUCENE-7484: Summary: FastVectorHighlighter fails to highlight SynonymQuery Key: LUCENE-7484 URL: https://issues.apache.org/jira/browse/LUCENE-7484 Project: Lucene - Core Issue Type: Bug Components: core/termvectors Affects Versions: 6.x, master (7.0) Reporter: Ferenczi Jim SynonymQuery are ignored by the FastVectorHighlighter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.
[ https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim closed LUCENE-7423. Resolution: Not A Problem > AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on > text fields. > --- > > Key: LUCENE-7423 > URL: https://issues.apache.org/jira/browse/LUCENE-7423 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/sandbox >Reporter: Ferenczi Jim >Priority: Minor > Attachments: LUCENE-7423.patch > > > The autoprefix terms dict added in > https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with > https://issues.apache.org/jira/browse/LUCENE-7317. > The new points API is now used to do efficient range queries but the > replacement for prefix string queries is unclear. The edge ngrams could be > used instead but they have a lot of drawbacks and are hard to configure > correctly. The completion postings format is also a good replacement but it > requires to have a big FST in RAM and it cannot be intersected with other > fields. > This patch is a proposal for a new PostingsFormat optimized for prefix query > on string fields. It detects prefixes that match "enough" terms and writes > auto-prefix terms into their own virtual field. > At search time the virtual field is used to speed up prefix queries that > match "enough" terms. > The auto-prefix terms are built in two pass: > * The first pass builds a compact prefix tree. Since the terms enum is sorted > the prefixes are flushed on the fly depending on the input. For each prefix > we build its corresponding inverted lists using a DocIdSetBuilder. The first > pass visits each term of the field TermsEnum only once. When a prefix is > flushed from the prefix tree its inverted lists is dumped into a temporary > file for further use. This is necessary since the prefixes are not sorted > when they are removed from the tree. The selected auto prefixes are sorted at > the end of the first pass. > * The second pass is a sorted scan of the prefixes and the temporary file is > used to read the corresponding inverted lists. > The patch is just a POC and there are rooms for optimizations but the first > results are promising: > I tested the patch with the geonames dataset. I indexed all the titles with > the KeywordAnalyzer and compared the index/merge time and the size of the > indices. > The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes > 572M on disk and it took 130s to index and optimize the 11M titles. > The auto prefix index takes 287M on disk and took 70s to index and optimize > the same 11M titles. Among the 287M, only 170M are used for the auto prefix > fields and the rest is for the regular keyword field. All the auto prefixes > were generated for this test (at least 2 terms per auto-prefix). > The queries have similar performance since we are sure on both sides that one > inverted list can answer any prefix query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.
[ https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434935#comment-15434935 ] Ferenczi Jim edited comment on LUCENE-7423 at 8/25/16 9:05 AM: --- (edited since the results of the autoprefix were wrong due to a bug in the code to generate the prefixes) I've added a small benchmark AutoPrefixPerf.java (modified from [~mikemccand] utils). For the benchmark I used the english wikipedia title and a standard analyzer: {panel:title=Standard analyzer|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} A single field in this test: * "field": standard analyzer {noformat} Indexed 1260: 33.756 sec Final Indexed 12696047: 33.9 sec Optimize... After force merge: 37.794 sec Close... After close: 37.798 sec Done CheckIndex: Segments file=segments_1 numSegments=1 version=7.0.0 id=ex11gzoft89z21le5c93bpett 1 of 1: name=_j maxDoc=12696047 version=7.0.0 id=ex11gzoft89z21le5c93bpets codec=Lucene62 compound=false numFiles=7 size (MB)=78.562 diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472043738648} no deletions test: open reader.OK [took 0.002 sec] test: check integrity.OK [took 0.046 sec] test: check live docs.OK [took 0.000 sec] test: field infos.OK [1 fields] [took 0.000 sec] test: field norms.OK [0 fields] [took 0.000 sec] test: terms, freq, prox...OK [2513966 terms; 34713220 terms/docs pairs; 0 tokens] [took 2.321 sec] field "field": index FST: 699982 bytes terms: 2513966 terms 20843092 bytes (8.3 bytes/term) blocks: 80953 blocks 59384 terms-only blocks 10 sub-block-only blocks 21559 mixed blocks 18273 floor blocks 25611 non-floor blocks 55342 floor sub-blocks 13294379 term suffix bytes (164.2 suffix-bytes/block) 2538232 term stats bytes (31.4 stats-bytes/block) 8829391 other bytes (109.1 other-bytes/block) by prefix length: 0: 5 1: 421 2: 5620 3: 18794 4: 31598 5: 16630 6: 5322 7: 1709 8: 443 9: 138 10: 249 11: 14 12: 2 13: 6 14: 2 test: stored fields...OK [0 total field count; avg 0.0 fields per doc] [took 0.257 sec] test: term vectorsOK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec] test: points..OK [0 fields, 0 points] [took 0.000 sec] detailed segment RAM usage: _j(7.0.0):C12696047: 741.9 KB |-- postings [PerFieldPostings(segment=_j formats=1)]: 683.8 KB |-- format 'Lucene50_0' [BlockTreeTermsReader(fields=1,delegate=Lucene50PostingsReader(positions=false,payloads=false))]: 683.8 KB |-- field 'field' [BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 683.7 KB |-- term index [FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB |-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 32 bytes |-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 58.1 KB |-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 58.1 KB |-- doc base deltas: 29.1 KB |-- start pointer deltas: 26.6 KB No problems were detected with this index. {noformat} {panel} -{panel:title=EdgeNgram analyzer min=2 max=5 |borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} Two fields for this test: * "field": standard analyzer * field-edge: edge ngram analyzer (min=2, max=5) on top of a standard analyzer. {noformat} Indexed 1260: 70.831 sec Final Indexed 12696047: 71.484 sec Optimize... After force merge: 80.344 sec Close... After close: 80.347 sec Done CheckIndex: Segments file=segments_1 numSegments=1 version=7.0.0 id=8bm8xy2peb5wo3td0ptgwv036 1 of 1: name=_19 maxDoc=12696047 version=7.0.0 id=8bm8xy2peb5wo3td0ptgwv035 codec=Lucene62 compound=false numFiles=7 size (MB)=224.803 diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, source=merge, mergeFactor=15, os.version=10.11.4, timestamp=1472044255056} no deletions
[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.
[ https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7423: - Attachment: LUCENE-7423.patch I fixed another bug in the prefixes creation. Most of the prefixes were missing in my last patch so the latest result shows a completely different trend. Sorry for the noise [~rcmuir] , you were right, the 2-5 edge ngram is competitive and it beats the autoprefix by a large margin. Here are the raw results actualized with the latest patch: {panel:title=AutoPrefix minPrefixTerms=2|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} {noformat} Indexed 1260: 60.369 sec Final Indexed 12696047: 60.589 sec Optimize... After force merge: 87.115 sec Close... After close: 87.121 sec Done CheckIndex: Segments file=segments_1 numSegments=1 version=7.0.0 id=2jb0oyddk8jizhc2jin5e5vwf 1 of 1: name=_j maxDoc=12696047 version=7.0.0 id=2jb0oyddk8jizhc2jin5e5vwe codec=Lucene62 compound=false numFiles=7 size (MB)=300.525 diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472114132102} no deletions test: open reader.OK [took 0.002 sec] test: check integrity.OK [took 0.172 sec] test: check live docs.OK [took 0.002 sec] test: field infos.OK [2 fields] [took 0.001 sec] test: field norms.OK [0 fields] [took 0.001 sec] test: terms, freq, prox...OK [3928482 terms; 31646 terms/docs pairs; 0 tokens] [took 5.478 sec] field "field-autoprefix": index FST: 401257 bytes terms: 1414516 terms 10333460 bytes (7.3 bytes/term) blocks: 45642 blocks 33484 terms-only blocks 4 sub-block-only blocks 12154 mixed blocks 10382 floor blocks 14302 non-floor blocks 31340 floor sub-blocks 6305776 term suffix bytes (138.2 suffix-bytes/block) 1484556 term stats bytes (32.5 stats-bytes/block) 3661060 other bytes (80.2 other-bytes/block) by prefix length: 0: 6 1: 366 2: 3205 3: 13054 4: 17849 5: 7613 6: 2466 7: 752 8: 202 9: 62 10: 58 11: 7 13: 1 14: 1 field "field": index FST: 699971 bytes terms: 2513966 terms 20843092 bytes (8.3 bytes/term) blocks: 80953 blocks 59384 terms-only blocks 10 sub-block-only blocks 21559 mixed blocks 18273 floor blocks 25611 non-floor blocks 55342 floor sub-blocks 13294381 term suffix bytes (164.2 suffix-bytes/block) 2538232 term stats bytes (31.4 stats-bytes/block) 8839046 other bytes (109.2 other-bytes/block) by prefix length: 0: 5 1: 421 2: 5620 3: 18794 4: 31598 5: 16630 6: 5322 7: 1709 8: 443 9: 138 10: 249 11: 14 12: 2 13: 6 14: 2 test: stored fields...OK [0 total field count; avg 0.0 fields per doc] [took 0.304 sec] test: term vectorsOK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.002 sec] test: docvalues...OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.001 sec] test: points..OK [0 fields, 0 points] [took 0.000 sec] detailed segment RAM usage: _j(7.0.0):C12696047: 1.1 MB |-- postings [PerFieldPostings(segment=_j formats=1)]: 1.1 MB |-- format 'AutoPrefix_0' [BlockTreeTermsReader(fields=2,delegate=Lucene50PostingsReader(positions=false,payloads=false))]: 1.1 MB |-- field 'field' [BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 683.7 KB |-- term index [FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB |-- field 'field-autoprefix' [BlockTreeTerms(terms=1414516,postings=187518426,positions=-1,docs=12682564)]: 392 KB |-- term index [FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 391.9 KB |-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 32 bytes |-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 61.9 KB |-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 61.9 KB |-- doc base deltas: 30.5 KB |-- start pointer deltas: 29
[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.
[ https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7423: - Attachment: (was: LUCENE-7423.patch) > AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on > text fields. > --- > > Key: LUCENE-7423 > URL: https://issues.apache.org/jira/browse/LUCENE-7423 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/sandbox >Reporter: Ferenczi Jim >Priority: Minor > > The autoprefix terms dict added in > https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with > https://issues.apache.org/jira/browse/LUCENE-7317. > The new points API is now used to do efficient range queries but the > replacement for prefix string queries is unclear. The edge ngrams could be > used instead but they have a lot of drawbacks and are hard to configure > correctly. The completion postings format is also a good replacement but it > requires to have a big FST in RAM and it cannot be intersected with other > fields. > This patch is a proposal for a new PostingsFormat optimized for prefix query > on string fields. It detects prefixes that match "enough" terms and writes > auto-prefix terms into their own virtual field. > At search time the virtual field is used to speed up prefix queries that > match "enough" terms. > The auto-prefix terms are built in two pass: > * The first pass builds a compact prefix tree. Since the terms enum is sorted > the prefixes are flushed on the fly depending on the input. For each prefix > we build its corresponding inverted lists using a DocIdSetBuilder. The first > pass visits each term of the field TermsEnum only once. When a prefix is > flushed from the prefix tree its inverted lists is dumped into a temporary > file for further use. This is necessary since the prefixes are not sorted > when they are removed from the tree. The selected auto prefixes are sorted at > the end of the first pass. > * The second pass is a sorted scan of the prefixes and the temporary file is > used to read the corresponding inverted lists. > The patch is just a POC and there are rooms for optimizations but the first > results are promising: > I tested the patch with the geonames dataset. I indexed all the titles with > the KeywordAnalyzer and compared the index/merge time and the size of the > indices. > The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes > 572M on disk and it took 130s to index and optimize the 11M titles. > The auto prefix index takes 287M on disk and took 70s to index and optimize > the same 11M titles. Among the 287M, only 170M are used for the auto prefix > fields and the rest is for the regular keyword field. All the auto prefixes > were generated for this test (at least 2 terms per auto-prefix). > The queries have similar performance since we are sure on both sides that one > inverted list can answer any prefix query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.
[ https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7423: - Attachment: (was: LUCENE-7423.patch) > AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on > text fields. > --- > > Key: LUCENE-7423 > URL: https://issues.apache.org/jira/browse/LUCENE-7423 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/sandbox >Reporter: Ferenczi Jim >Priority: Minor > > The autoprefix terms dict added in > https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with > https://issues.apache.org/jira/browse/LUCENE-7317. > The new points API is now used to do efficient range queries but the > replacement for prefix string queries is unclear. The edge ngrams could be > used instead but they have a lot of drawbacks and are hard to configure > correctly. The completion postings format is also a good replacement but it > requires to have a big FST in RAM and it cannot be intersected with other > fields. > This patch is a proposal for a new PostingsFormat optimized for prefix query > on string fields. It detects prefixes that match "enough" terms and writes > auto-prefix terms into their own virtual field. > At search time the virtual field is used to speed up prefix queries that > match "enough" terms. > The auto-prefix terms are built in two pass: > * The first pass builds a compact prefix tree. Since the terms enum is sorted > the prefixes are flushed on the fly depending on the input. For each prefix > we build its corresponding inverted lists using a DocIdSetBuilder. The first > pass visits each term of the field TermsEnum only once. When a prefix is > flushed from the prefix tree its inverted lists is dumped into a temporary > file for further use. This is necessary since the prefixes are not sorted > when they are removed from the tree. The selected auto prefixes are sorted at > the end of the first pass. > * The second pass is a sorted scan of the prefixes and the temporary file is > used to read the corresponding inverted lists. > The patch is just a POC and there are rooms for optimizations but the first > results are promising: > I tested the patch with the geonames dataset. I indexed all the titles with > the KeywordAnalyzer and compared the index/merge time and the size of the > indices. > The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes > 572M on disk and it took 130s to index and optimize the 11M titles. > The auto prefix index takes 287M on disk and took 70s to index and optimize > the same 11M titles. Among the 287M, only 170M are used for the auto prefix > fields and the rest is for the regular keyword field. All the auto prefixes > were generated for this test (at least 2 terms per auto-prefix). > The queries have similar performance since we are sure on both sides that one > inverted list can answer any prefix query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.
[ https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7423: - Attachment: LUCENE-7423.patch > AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on > text fields. > --- > > Key: LUCENE-7423 > URL: https://issues.apache.org/jira/browse/LUCENE-7423 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/sandbox >Reporter: Ferenczi Jim >Priority: Minor > Attachments: LUCENE-7423.patch, LUCENE-7423.patch > > > The autoprefix terms dict added in > https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with > https://issues.apache.org/jira/browse/LUCENE-7317. > The new points API is now used to do efficient range queries but the > replacement for prefix string queries is unclear. The edge ngrams could be > used instead but they have a lot of drawbacks and are hard to configure > correctly. The completion postings format is also a good replacement but it > requires to have a big FST in RAM and it cannot be intersected with other > fields. > This patch is a proposal for a new PostingsFormat optimized for prefix query > on string fields. It detects prefixes that match "enough" terms and writes > auto-prefix terms into their own virtual field. > At search time the virtual field is used to speed up prefix queries that > match "enough" terms. > The auto-prefix terms are built in two pass: > * The first pass builds a compact prefix tree. Since the terms enum is sorted > the prefixes are flushed on the fly depending on the input. For each prefix > we build its corresponding inverted lists using a DocIdSetBuilder. The first > pass visits each term of the field TermsEnum only once. When a prefix is > flushed from the prefix tree its inverted lists is dumped into a temporary > file for further use. This is necessary since the prefixes are not sorted > when they are removed from the tree. The selected auto prefixes are sorted at > the end of the first pass. > * The second pass is a sorted scan of the prefixes and the temporary file is > used to read the corresponding inverted lists. > The patch is just a POC and there are rooms for optimizations but the first > results are promising: > I tested the patch with the geonames dataset. I indexed all the titles with > the KeywordAnalyzer and compared the index/merge time and the size of the > indices. > The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes > 572M on disk and it took 130s to index and optimize the 11M titles. > The auto prefix index takes 287M on disk and took 70s to index and optimize > the same 11M titles. Among the 287M, only 170M are used for the auto prefix > fields and the rest is for the regular keyword field. All the auto prefixes > were generated for this test (at least 2 terms per auto-prefix). > The queries have similar performance since we are sure on both sides that one > inverted list can answer any prefix query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.
[ https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7423: - Attachment: (was: LUCENE-7423.patch) > AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on > text fields. > --- > > Key: LUCENE-7423 > URL: https://issues.apache.org/jira/browse/LUCENE-7423 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/sandbox >Reporter: Ferenczi Jim >Priority: Minor > Attachments: LUCENE-7423.patch, LUCENE-7423.patch > > > The autoprefix terms dict added in > https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with > https://issues.apache.org/jira/browse/LUCENE-7317. > The new points API is now used to do efficient range queries but the > replacement for prefix string queries is unclear. The edge ngrams could be > used instead but they have a lot of drawbacks and are hard to configure > correctly. The completion postings format is also a good replacement but it > requires to have a big FST in RAM and it cannot be intersected with other > fields. > This patch is a proposal for a new PostingsFormat optimized for prefix query > on string fields. It detects prefixes that match "enough" terms and writes > auto-prefix terms into their own virtual field. > At search time the virtual field is used to speed up prefix queries that > match "enough" terms. > The auto-prefix terms are built in two pass: > * The first pass builds a compact prefix tree. Since the terms enum is sorted > the prefixes are flushed on the fly depending on the input. For each prefix > we build its corresponding inverted lists using a DocIdSetBuilder. The first > pass visits each term of the field TermsEnum only once. When a prefix is > flushed from the prefix tree its inverted lists is dumped into a temporary > file for further use. This is necessary since the prefixes are not sorted > when they are removed from the tree. The selected auto prefixes are sorted at > the end of the first pass. > * The second pass is a sorted scan of the prefixes and the temporary file is > used to read the corresponding inverted lists. > The patch is just a POC and there are rooms for optimizations but the first > results are promising: > I tested the patch with the geonames dataset. I indexed all the titles with > the KeywordAnalyzer and compared the index/merge time and the size of the > indices. > The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes > 572M on disk and it took 130s to index and optimize the 11M titles. > The auto prefix index takes 287M on disk and took 70s to index and optimize > the same 11M titles. Among the 287M, only 170M are used for the auto prefix > fields and the rest is for the regular keyword field. All the auto prefixes > were generated for this test (at least 2 terms per auto-prefix). > The queries have similar performance since we are sure on both sides that one > inverted list can answer any prefix query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.
[ https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434935#comment-15434935 ] Ferenczi Jim edited comment on LUCENE-7423 at 8/24/16 1:32 PM: --- Another iteration. I fixed the prefix selection (the term "aa" should not increment the number of terms accounted for the term "a"). This reduces the index size greatly. I've added a small benchmark AutoPrefixPerf.java (modified from [~mikemccand] utils). For the benchmark I used the english wikipedia title and a standard analyzer: {panel:title=Standard analyzer|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} A single field in this test: * "field": standard analyzer {noformat} Indexed 1260: 33.756 sec Final Indexed 12696047: 33.9 sec Optimize... After force merge: 37.794 sec Close... After close: 37.798 sec Done CheckIndex: Segments file=segments_1 numSegments=1 version=7.0.0 id=ex11gzoft89z21le5c93bpett 1 of 1: name=_j maxDoc=12696047 version=7.0.0 id=ex11gzoft89z21le5c93bpets codec=Lucene62 compound=false numFiles=7 size (MB)=78.562 diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472043738648} no deletions test: open reader.OK [took 0.002 sec] test: check integrity.OK [took 0.046 sec] test: check live docs.OK [took 0.000 sec] test: field infos.OK [1 fields] [took 0.000 sec] test: field norms.OK [0 fields] [took 0.000 sec] test: terms, freq, prox...OK [2513966 terms; 34713220 terms/docs pairs; 0 tokens] [took 2.321 sec] field "field": index FST: 699982 bytes terms: 2513966 terms 20843092 bytes (8.3 bytes/term) blocks: 80953 blocks 59384 terms-only blocks 10 sub-block-only blocks 21559 mixed blocks 18273 floor blocks 25611 non-floor blocks 55342 floor sub-blocks 13294379 term suffix bytes (164.2 suffix-bytes/block) 2538232 term stats bytes (31.4 stats-bytes/block) 8829391 other bytes (109.1 other-bytes/block) by prefix length: 0: 5 1: 421 2: 5620 3: 18794 4: 31598 5: 16630 6: 5322 7: 1709 8: 443 9: 138 10: 249 11: 14 12: 2 13: 6 14: 2 test: stored fields...OK [0 total field count; avg 0.0 fields per doc] [took 0.257 sec] test: term vectorsOK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec] test: points..OK [0 fields, 0 points] [took 0.000 sec] detailed segment RAM usage: _j(7.0.0):C12696047: 741.9 KB |-- postings [PerFieldPostings(segment=_j formats=1)]: 683.8 KB |-- format 'Lucene50_0' [BlockTreeTermsReader(fields=1,delegate=Lucene50PostingsReader(positions=false,payloads=false))]: 683.8 KB |-- field 'field' [BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 683.7 KB |-- term index [FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB |-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 32 bytes |-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 58.1 KB |-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 58.1 KB |-- doc base deltas: 29.1 KB |-- start pointer deltas: 26.6 KB No problems were detected with this index. {noformat} {panel} {panel:title=EdgeNgram analyzer min=2 max=5 |borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} Two fields for this test: * "field": standard analyzer * field-edge: edge ngram analyzer (min=2, max=5) on top of a standard analyzer. {noformat} Indexed 1260: 70.831 sec Final Indexed 12696047: 71.484 sec Optimize... After force merge: 80.344 sec Close... After close: 80.347 sec Done CheckIndex: Segments file=segments_1 numSegments=1 version=7.0.0 id=8bm8xy2peb5wo3td0ptgwv036 1 of 1: name=_19 maxDoc=12696047 version=7.0.0 id=8bm8xy2peb5wo3td0ptgwv035 codec=Lucene62 compound=false numFiles=7 size (MB)=224.803 diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, source=merge, mergeFactor=15, o
[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.
[ https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7423: - Attachment: LUCENE-7423.patch Another iteration. I fixed the prefix selection (the term "aa" should not increment the number of terms accounted for the term "a"). This reduces the index size greatly. I've added a small benchmark AutoPrefixPerf.java (modified from [~mikemccand] utils). For the benchmark I used the english wikipedia title and a standard analyzer: {panel:title=Standard analyzer|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} A single field in this test: * "field": standard analyzer {noformat} Indexed 1260: 33.756 sec Final Indexed 12696047: 33.9 sec Optimize... After force merge: 37.794 sec Close... After close: 37.798 sec Done CheckIndex: Segments file=segments_1 numSegments=1 version=7.0.0 id=ex11gzoft89z21le5c93bpett 1 of 1: name=_j maxDoc=12696047 version=7.0.0 id=ex11gzoft89z21le5c93bpets codec=Lucene62 compound=false numFiles=7 size (MB)=78.562 diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472043738648} no deletions test: open reader.OK [took 0.002 sec] test: check integrity.OK [took 0.046 sec] test: check live docs.OK [took 0.000 sec] test: field infos.OK [1 fields] [took 0.000 sec] test: field norms.OK [0 fields] [took 0.000 sec] test: terms, freq, prox...OK [2513966 terms; 34713220 terms/docs pairs; 0 tokens] [took 2.321 sec] field "field": index FST: 699982 bytes terms: 2513966 terms 20843092 bytes (8.3 bytes/term) blocks: 80953 blocks 59384 terms-only blocks 10 sub-block-only blocks 21559 mixed blocks 18273 floor blocks 25611 non-floor blocks 55342 floor sub-blocks 13294379 term suffix bytes (164.2 suffix-bytes/block) 2538232 term stats bytes (31.4 stats-bytes/block) 8829391 other bytes (109.1 other-bytes/block) by prefix length: 0: 5 1: 421 2: 5620 3: 18794 4: 31598 5: 16630 6: 5322 7: 1709 8: 443 9: 138 10: 249 11: 14 12: 2 13: 6 14: 2 test: stored fields...OK [0 total field count; avg 0.0 fields per doc] [took 0.257 sec] test: term vectorsOK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec] test: points..OK [0 fields, 0 points] [took 0.000 sec] detailed segment RAM usage: _j(7.0.0):C12696047: 741.9 KB |-- postings [PerFieldPostings(segment=_j formats=1)]: 683.8 KB |-- format 'Lucene50_0' [BlockTreeTermsReader(fields=1,delegate=Lucene50PostingsReader(positions=false,payloads=false))]: 683.8 KB |-- field 'field' [BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 683.7 KB |-- term index [FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB |-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 32 bytes |-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 58.1 KB |-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 58.1 KB |-- doc base deltas: 29.1 KB |-- start pointer deltas: 26.6 KB No problems were detected with this index. {noformat} {panel} {panel:title=EdgeNgram analyzer min=2 max=5 |borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} Two fields for this test: * "field": standard analyzer * field-edge: edge ngram analyzer (min=2, max=5) on top of a standard analyzer. {noformat} Indexed 1260: 70.831 sec Final Indexed 12696047: 71.484 sec Optimize... After force merge: 80.344 sec Close... After close: 80.347 sec Done CheckIndex: Segments file=segments_1 numSegments=1 version=7.0.0 id=8bm8xy2peb5wo3td0ptgwv036 1 of 1: name=_19 maxDoc=12696047 version=7.0.0 id=8bm8xy2peb5wo3td0ptgwv035 codec=Lucene62 compound=false numFiles=7 size (MB)=224.803 diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, source=merge, mergeFactor=15, os.version=10.11.4, timestamp=1472044255056} no deletions test: o
[jira] [Commented] (LUCENE-7317) Remove auto prefix terms
[ https://issues.apache.org/jira/browse/LUCENE-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433832#comment-15433832 ] Ferenczi Jim commented on LUCENE-7317: -- Sorry for the late reply. Yep min=1/max=2B is not a reasonable setting but I have similar results with min=1/max=20 so I think it is worth investigating. I opened https://issues.apache.org/jira/browse/LUCENE-7423 which re-implements the auto prefix in a new PostingsFormat that builds the prefixes in two pass like the previous implementation. The nice thing is that it avoids the combinatorial explosion that affected the previous implementation where we needed to visit all the matching terms for each prefix. > Remove auto prefix terms > > > Key: LUCENE-7317 > URL: https://issues.apache.org/jira/browse/LUCENE-7317 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7317.patch > > > This was mostly superseded by the new points API so should we remove > auto-prefix terms? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.
[ https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7423: - Attachment: LUCENE-7423.patch > AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on > text fields. > --- > > Key: LUCENE-7423 > URL: https://issues.apache.org/jira/browse/LUCENE-7423 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/sandbox >Reporter: Ferenczi Jim >Priority: Minor > Attachments: LUCENE-7423.patch > > > The autoprefix terms dict added in > https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with > https://issues.apache.org/jira/browse/LUCENE-7317. > The new points API is now used to do efficient range queries but the > replacement for prefix string queries is unclear. The edge ngrams could be > used instead but they have a lot of drawbacks and are hard to configure > correctly. The completion postings format is also a good replacement but it > requires to have a big FST in RAM and it cannot be intersected with other > fields. > This patch is a proposal for a new PostingsFormat optimized for prefix query > on string fields. It detects prefixes that match "enough" terms and writes > auto-prefix terms into their own virtual field. > At search time the virtual field is used to speed up prefix queries that > match "enough" terms. > The auto-prefix terms are built in two pass: > * The first pass builds a compact prefix tree. Since the terms enum is sorted > the prefixes are flushed on the fly depending on the input. For each prefix > we build its corresponding inverted lists using a DocIdSetBuilder. The first > pass visits each term of the field TermsEnum only once. When a prefix is > flushed from the prefix tree its inverted lists is dumped into a temporary > file for further use. This is necessary since the prefixes are not sorted > when they are removed from the tree. The selected auto prefixes are sorted at > the end of the first pass. > * The second pass is a sorted scan of the prefixes and the temporary file is > used to read the corresponding inverted lists. > The patch is just a POC and there are rooms for optimizations but the first > results are promising: > I tested the patch with the geonames dataset. I indexed all the titles with > the KeywordAnalyzer and compared the index/merge time and the size of the > indices. > The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes > 572M on disk and it took 130s to index and optimize the 11M titles. > The auto prefix index takes 287M on disk and took 70s to index and optimize > the same 11M titles. Among the 287M, only 170M are used for the auto prefix > fields and the rest is for the regular keyword field. All the auto prefixes > were generated for this test (at least 2 terms per auto-prefix). > The queries have similar performance since we are sure on both sides that one > inverted list can answer any prefix query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.
[ https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7423: - Attachment: (was: LUCENE-7423.patch) > AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on > text fields. > --- > > Key: LUCENE-7423 > URL: https://issues.apache.org/jira/browse/LUCENE-7423 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/sandbox >Reporter: Ferenczi Jim >Priority: Minor > Attachments: LUCENE-7423.patch > > > The autoprefix terms dict added in > https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with > https://issues.apache.org/jira/browse/LUCENE-7317. > The new points API is now used to do efficient range queries but the > replacement for prefix string queries is unclear. The edge ngrams could be > used instead but they have a lot of drawbacks and are hard to configure > correctly. The completion postings format is also a good replacement but it > requires to have a big FST in RAM and it cannot be intersected with other > fields. > This patch is a proposal for a new PostingsFormat optimized for prefix query > on string fields. It detects prefixes that match "enough" terms and writes > auto-prefix terms into their own virtual field. > At search time the virtual field is used to speed up prefix queries that > match "enough" terms. > The auto-prefix terms are built in two pass: > * The first pass builds a compact prefix tree. Since the terms enum is sorted > the prefixes are flushed on the fly depending on the input. For each prefix > we build its corresponding inverted lists using a DocIdSetBuilder. The first > pass visits each term of the field TermsEnum only once. When a prefix is > flushed from the prefix tree its inverted lists is dumped into a temporary > file for further use. This is necessary since the prefixes are not sorted > when they are removed from the tree. The selected auto prefixes are sorted at > the end of the first pass. > * The second pass is a sorted scan of the prefixes and the temporary file is > used to read the corresponding inverted lists. > The patch is just a POC and there are rooms for optimizations but the first > results are promising: > I tested the patch with the geonames dataset. I indexed all the titles with > the KeywordAnalyzer and compared the index/merge time and the size of the > indices. > The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes > 572M on disk and it took 130s to index and optimize the 11M titles. > The auto prefix index takes 287M on disk and took 70s to index and optimize > the same 11M titles. Among the 287M, only 170M are used for the auto prefix > fields and the rest is for the regular keyword field. All the auto prefixes > were generated for this test (at least 2 terms per auto-prefix). > The queries have similar performance since we are sure on both sides that one > inverted list can answer any prefix query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.
[ https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7423: - Attachment: LUCENE-7423.patch > AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on > text fields. > --- > > Key: LUCENE-7423 > URL: https://issues.apache.org/jira/browse/LUCENE-7423 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/sandbox >Reporter: Ferenczi Jim >Priority: Minor > Attachments: LUCENE-7423.patch > > > The autoprefix terms dict added in > https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with > https://issues.apache.org/jira/browse/LUCENE-7317. > The new points API is now used to do efficient range queries but the > replacement for prefix string queries is unclear. The edge ngrams could be > used instead but they have a lot of drawbacks and are hard to configure > correctly. The completion postings format is also a good replacement but it > requires to have a big FST in RAM and it cannot be intersected with other > fields. > This patch is a proposal for a new PostingsFormat optimized for prefix query > on string fields. It detects prefixes that match "enough" terms and writes > auto-prefix terms into their own virtual field. > At search time the virtual field is used to speed up prefix queries that > match "enough" terms. > The auto-prefix terms are built in two pass: > * The first pass builds a compact prefix tree. Since the terms enum is sorted > the prefixes are flushed on the fly depending on the input. For each prefix > we build its corresponding inverted lists using a DocIdSetBuilder. The first > pass visits each term of the field TermsEnum only once. When a prefix is > flushed from the prefix tree its inverted lists is dumped into a temporary > file for further use. This is necessary since the prefixes are not sorted > when they are removed from the tree. The selected auto prefixes are sorted at > the end of the first pass. > * The second pass is a sorted scan of the prefixes and the temporary file is > used to read the corresponding inverted lists. > The patch is just a POC and there are rooms for optimizations but the first > results are promising: > I tested the patch with the geonames dataset. I indexed all the titles with > the KeywordAnalyzer and compared the index/merge time and the size of the > indices. > The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes > 572M on disk and it took 130s to index and optimize the 11M titles. > The auto prefix index takes 287M on disk and took 70s to index and optimize > the same 11M titles. Among the 287M, only 170M are used for the auto prefix > fields and the rest is for the regular keyword field. All the auto prefixes > were generated for this test (at least 2 terms per auto-prefix). > The queries have similar performance since we are sure on both sides that one > inverted list can answer any prefix query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.
Ferenczi Jim created LUCENE-7423: Summary: AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields. Key: LUCENE-7423 URL: https://issues.apache.org/jira/browse/LUCENE-7423 Project: Lucene - Core Issue Type: New Feature Components: modules/sandbox Reporter: Ferenczi Jim Priority: Minor The autoprefix terms dict added in https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with https://issues.apache.org/jira/browse/LUCENE-7317. The new points API is now used to do efficient range queries but the replacement for prefix string queries is unclear. The edge ngrams could be used instead but they have a lot of drawbacks and are hard to configure correctly. The completion postings format is also a good replacement but it requires to have a big FST in RAM and it cannot be intersected with other fields. This patch is a proposal for a new PostingsFormat optimized for prefix query on string fields. It detects prefixes that match "enough" terms and writes auto-prefix terms into their own virtual field. At search time the virtual field is used to speed up prefix queries that match "enough" terms. The auto-prefix terms are built in two pass: * The first pass builds a compact prefix tree. Since the terms enum is sorted the prefixes are flushed on the fly depending on the input. For each prefix we build its corresponding inverted lists using a DocIdSetBuilder. The first pass visits each term of the field TermsEnum only once. When a prefix is flushed from the prefix tree its inverted lists is dumped into a temporary file for further use. This is necessary since the prefixes are not sorted when they are removed from the tree. The selected auto prefixes are sorted at the end of the first pass. * The second pass is a sorted scan of the prefixes and the temporary file is used to read the corresponding inverted lists. The patch is just a POC and there are rooms for optimizations but the first results are promising: I tested the patch with the geonames dataset. I indexed all the titles with the KeywordAnalyzer and compared the index/merge time and the size of the indices. The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes 572M on disk and it took 130s to index and optimize the 11M titles. The auto prefix index takes 287M on disk and took 70s to index and optimize the same 11M titles. Among the 287M, only 170M are used for the auto prefix fields and the rest is for the regular keyword field. All the auto prefixes were generated for this test (at least 2 terms per auto-prefix). The queries have similar performance since we are sure on both sides that one inverted list can answer any prefix query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7317) Remove auto prefix terms
[ https://issues.apache.org/jira/browse/LUCENE-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426601#comment-15426601 ] Ferenczi Jim commented on LUCENE-7317: -- I wanted to see what we're loosing with the removal of the AutoPrefix so I ran a small test with English Wikipedia title. I indexed the 12M titles in three indices: * *default*: keyword analyzer and the default postings format * *auto_prefix*: keyword analyzer and the AutoPrefixPostings format with minAutoPrefix=24, maxAutoPrefix=Integer.MAX * *edge*: edge ngram analyzer with minGram=1,maxGram=Integer.MAX and the default postings format. ||index||default||auto_prefix||edge|| ||size in MB||231MB||274 MB||1600MB|| This table shows the size that each index takes on disk in bytes. As you can see the auto_prefix is very close to the size of the default one even though we compute all the prefix with more than 24 terms. Compared to the edge_ngram which multiplies the index size by a factor 7, the auto prefix seems to be a good trade off for fields where prefix queries are the norm. I didn't compare the query time but any prefix with more than 24 terms could be resolved by one inverted list in the auto_prefix index so it is equivalent to the edge_ngram index. The downside of the auto_prefix seems to be the merge, it takes more than 1 minute to optimize, this is 10 times slower than the default index. Though this is expected since the default index uses a keyword analyzer. I understand that the new points APIs is better for numeric prefix/range queries but the auto prefix seems to be a good fit for prefix string queries. It saves a lot of spaces compared to edge ngram and the indexation is faster. I am not saying we should restore the functionality inside the default BlockTreeTerms but maybe we could create a separate postings format that exposes this feature ? > Remove auto prefix terms > > > Key: LUCENE-7317 > URL: https://issues.apache.org/jira/browse/LUCENE-7317 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7317.patch > > > This was mostly superseded by the new points API so should we remove > auto-prefix terms? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7337) MultiTermQuery are sometimes rewritten into an empty boolean query
[ https://issues.apache.org/jira/browse/LUCENE-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15339090#comment-15339090 ] Ferenczi Jim edited comment on LUCENE-7337 at 6/20/16 7:50 AM: --- Wooo thanks [~mikemccand] ??I think getting proper distributed queries working is really out of scope here: that would really require a distributed rewrite to work correctly.?? Agreed. Returning 1 or 0 for the queryNorm would not solve the problem anyway and I think it's more important to make empty-clause boolean query behaves exactly the same as the MatchNoDocsQuery. was (Author: jim.ferenczi): Wooo thanks [~mikemccand] ??I think getting proper distributed queries working is really out of scope here: that would really require a distributed rewrite to work correctly.?? Agreed. Returning 1 or 0 for the queryNorm would not solve the problem anyway and I think it's more important to make empty-clause boolean query behaves exactly the same as the MatchNoDocsQuery. > MultiTermQuery are sometimes rewritten into an empty boolean query > -- > > Key: LUCENE-7337 > URL: https://issues.apache.org/jira/browse/LUCENE-7337 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Reporter: Ferenczi Jim >Priority: Minor > Attachments: LUCENE-7337.patch > > > MultiTermQuery are sometimes rewritten to an empty boolean query (depending > on the rewrite method), it can happen when no expansions are found on a fuzzy > query for instance. > It can be problematic when the multi term query is boosted. > For instance consider the following query: > `((title:bar~1)^100 text:bar)` > This is a boolean query with two optional clauses. The first one is a fuzzy > query on the field title with a boost of 100. > If there is no expansion for "title:bar~1" the query is rewritten into: > `(()^100 text:bar)` > ... and when expansions are found: > `((title:bars | title:bar)^100 text:bar)` > The scoring of those two queries will differ because the normalization factor > and the norm for the first query will be equal to 1 (the boost is ignored > because the empty boolean query is not taken into account for the computation > of the normalization factor) whereas the second query will have a > normalization factor of 10,000 (100*100) and a norm equal to 0.01. > This kind of discrepancy can happen in a single index because the expansions > for the fuzzy query are done at the segment level. It can also happen when > multiple indices are requested (Solr/ElasticSearch case). > A simple fix would be to replace the empty boolean query produced by the > multi term query with a MatchNoDocsQuery but I am not sure that it's the best > way to fix. WDYT ? > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7337) MultiTermQuery are sometimes rewritten into an empty boolean query
[ https://issues.apache.org/jira/browse/LUCENE-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15339090#comment-15339090 ] Ferenczi Jim edited comment on LUCENE-7337 at 6/20/16 7:47 AM: --- Wooo thanks [~mikemccand] ??I think getting proper distributed queries working is really out of scope here: that would really require a distributed rewrite to work correctly.?? Agreed. Returning 1 or 0 for the queryNorm would not solve the problem anyway and I think it's more important to make empty-clause boolean query behaves exactly the same as the MatchNoDocsQuery. was (Author: jim.ferenczi): Wooo thanks [~mikemccand] ?I think getting proper distributed queries working is really out of scope here: that would really require a distributed rewrite to work correctly.? Agreed. Returning 1 or 0 for the queryNorm would not solve the problem anyway and I think it's more important to make empty-clause boolean query behaves exactly the same as the MatchNoDocsQuery. > MultiTermQuery are sometimes rewritten into an empty boolean query > -- > > Key: LUCENE-7337 > URL: https://issues.apache.org/jira/browse/LUCENE-7337 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Reporter: Ferenczi Jim >Priority: Minor > Attachments: LUCENE-7337.patch > > > MultiTermQuery are sometimes rewritten to an empty boolean query (depending > on the rewrite method), it can happen when no expansions are found on a fuzzy > query for instance. > It can be problematic when the multi term query is boosted. > For instance consider the following query: > `((title:bar~1)^100 text:bar)` > This is a boolean query with two optional clauses. The first one is a fuzzy > query on the field title with a boost of 100. > If there is no expansion for "title:bar~1" the query is rewritten into: > `(()^100 text:bar)` > ... and when expansions are found: > `((title:bars | title:bar)^100 text:bar)` > The scoring of those two queries will differ because the normalization factor > and the norm for the first query will be equal to 1 (the boost is ignored > because the empty boolean query is not taken into account for the computation > of the normalization factor) whereas the second query will have a > normalization factor of 10,000 (100*100) and a norm equal to 0.01. > This kind of discrepancy can happen in a single index because the expansions > for the fuzzy query are done at the segment level. It can also happen when > multiple indices are requested (Solr/ElasticSearch case). > A simple fix would be to replace the empty boolean query produced by the > multi term query with a MatchNoDocsQuery but I am not sure that it's the best > way to fix. WDYT ? > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7337) MultiTermQuery are sometimes rewritten into an empty boolean query
[ https://issues.apache.org/jira/browse/LUCENE-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15339090#comment-15339090 ] Ferenczi Jim commented on LUCENE-7337: -- Wooo thanks [~mikemccand] ?I think getting proper distributed queries working is really out of scope here: that would really require a distributed rewrite to work correctly.? Agreed. Returning 1 or 0 for the queryNorm would not solve the problem anyway and I think it's more important to make empty-clause boolean query behaves exactly the same as the MatchNoDocsQuery. > MultiTermQuery are sometimes rewritten into an empty boolean query > -- > > Key: LUCENE-7337 > URL: https://issues.apache.org/jira/browse/LUCENE-7337 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Reporter: Ferenczi Jim >Priority: Minor > Attachments: LUCENE-7337.patch > > > MultiTermQuery are sometimes rewritten to an empty boolean query (depending > on the rewrite method), it can happen when no expansions are found on a fuzzy > query for instance. > It can be problematic when the multi term query is boosted. > For instance consider the following query: > `((title:bar~1)^100 text:bar)` > This is a boolean query with two optional clauses. The first one is a fuzzy > query on the field title with a boost of 100. > If there is no expansion for "title:bar~1" the query is rewritten into: > `(()^100 text:bar)` > ... and when expansions are found: > `((title:bars | title:bar)^100 text:bar)` > The scoring of those two queries will differ because the normalization factor > and the norm for the first query will be equal to 1 (the boost is ignored > because the empty boolean query is not taken into account for the computation > of the normalization factor) whereas the second query will have a > normalization factor of 10,000 (100*100) and a norm equal to 0.01. > This kind of discrepancy can happen in a single index because the expansions > for the fuzzy query are done at the segment level. It can also happen when > multiple indices are requested (Solr/ElasticSearch case). > A simple fix would be to replace the empty boolean query produced by the > multi term query with a MatchNoDocsQuery but I am not sure that it's the best > way to fix. WDYT ? > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7337) MultiTermQuery are sometimes rewritten into an empty boolean query
[ https://issues.apache.org/jira/browse/LUCENE-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15329457#comment-15329457 ] Ferenczi Jim commented on LUCENE-7337: -- ??A simple fix would be to replace the empty boolean query produced by the multi term query with a MatchNoDocsQuery but I am not sure that it's the best way to fix.?? I am not sure of this statement anymore. Conceptually a MatchNoDocsQuery and a BooleanQuery with no clause are similar. Though what I proposed assumed that the value for normalization of the MatchNoDocsQuery is 1. I think that doing this would bring confusion since this value is supposed to reflect the max score that the query can get (which is 0 in this case). Currently a boolean query or a disjunction query with no clause return 0 for the normalization. I think it's the expected behavior even though this breaks the distributed case as explained in my previous comment. For empty queries that are the result of an expansion (multi term query) maybe we could add yet another special query, something like MatchNoExpansionQuery that would use a ConstantScoreWeight ? I am proposing this because this would make the distinction between a query that match no documents no matter what the context is and a query that match no documents because of the context (useful for the distributed case). > MultiTermQuery are sometimes rewritten into an empty boolean query > -- > > Key: LUCENE-7337 > URL: https://issues.apache.org/jira/browse/LUCENE-7337 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Reporter: Ferenczi Jim >Priority: Minor > > MultiTermQuery are sometimes rewritten to an empty boolean query (depending > on the rewrite method), it can happen when no expansions are found on a fuzzy > query for instance. > It can be problematic when the multi term query is boosted. > For instance consider the following query: > `((title:bar~1)^100 text:bar)` > This is a boolean query with two optional clauses. The first one is a fuzzy > query on the field title with a boost of 100. > If there is no expansion for "title:bar~1" the query is rewritten into: > `(()^100 text:bar)` > ... and when expansions are found: > `((title:bars | title:bar)^100 text:bar)` > The scoring of those two queries will differ because the normalization factor > and the norm for the first query will be equal to 1 (the boost is ignored > because the empty boolean query is not taken into account for the computation > of the normalization factor) whereas the second query will have a > normalization factor of 10,000 (100*100) and a norm equal to 0.01. > This kind of discrepancy can happen in a single index because the expansions > for the fuzzy query are done at the segment level. It can also happen when > multiple indices are requested (Solr/ElasticSearch case). > A simple fix would be to replace the empty boolean query produced by the > multi term query with a MatchNoDocsQuery but I am not sure that it's the best > way to fix. WDYT ? > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7276) Add an optional reason to the MatchNoDocsQuery
[ https://issues.apache.org/jira/browse/LUCENE-7276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327946#comment-15327946 ] Ferenczi Jim edited comment on LUCENE-7276 at 6/13/16 7:02 PM: --- ??Somehow the test is angry that the rewritten query scores differently from the original ... so somehow the fact that we no longer rewrite to an empty BQ is changing something ... I'll dig.?? I tried to find a reason and I think I found something interesting. The change is related to the normalization factor and the fact that those queries are boosted. When you use a boolean query with no clause the normalization factor is 0, when the matchnodocs query is used the normalization factor is 1 (BooleanWeight.getValueForNormalization and ConstantScoreWeight.getValueForNormalization). This part of the query is supposed to return no documents so it should be ok to ignore it when the query norm is computed. Though for the distributed case where results are merged from different shards there is no guarantee that the rewrite will be the same among the shards. I think we can get rid of the matchnodocsquery vs empty boolean query difference if we change the return value of BooleanWeight.getValueForNormalization to be 1 (instead of 0) when there is no clause. https://issues.apache.org/jira/browse/LUCENE-7337 was (Author: jim.ferenczi): ?? Somehow the test is angry that the rewritten query scores differently from the original ... so somehow the fact that we no longer rewrite to an empty BQ is changing something ... I'll dig. ?? I tried to find a reason and I think I found something interesting. The change is related to the normalization factor and the fact that those queries are boosted. When you use a boolean query with no clause the normalization factor is 0, when the matchnodocs query is used the normalization factor is 1 (BooleanWeight.getValueForNormalization and ConstantScoreWeight.getValueForNormalization). This part of the query is supposed to return no documents so it should be ok to ignore it when the query norm is computed. Though for the distributed case where results are merged from different shards there is no guarantee that the rewrite will be the same among the shards. I think we can get rid of the matchnodocsquery vs empty boolean query difference if we change the return value of BooleanWeight.getValueForNormalization to be 1 (instead of 0) when there is no clause. https://issues.apache.org/jira/browse/LUCENE-7337 > Add an optional reason to the MatchNoDocsQuery > -- > > Key: LUCENE-7276 > URL: https://issues.apache.org/jira/browse/LUCENE-7276 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Ferenczi Jim >Priority: Minor > Labels: patch > Attachments: LUCENE-7276.patch, LUCENE-7276.patch, LUCENE-7276.patch, > LUCENE-7276.patch, LUCENE-7276.patch > > > It's sometimes difficult to debug a query that results in a MatchNoDocsQuery. > The MatchNoDocsQuery is always rewritten in an empty boolean query. > This patch adds an optional reason and implements a weight in order to keep > track of the reason why the query did not match any document. The reason is > printed on toString and when an explanation for noMatch is asked. > For instance the query: > new MatchNoDocsQuery("Field not found").toString() > => 'MatchNoDocsQuery["field 'title' not found"]' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7276) Add an optional reason to the MatchNoDocsQuery
[ https://issues.apache.org/jira/browse/LUCENE-7276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327946#comment-15327946 ] Ferenczi Jim commented on LUCENE-7276: -- ?? Somehow the test is angry that the rewritten query scores differently from the original ... so somehow the fact that we no longer rewrite to an empty BQ is changing something ... I'll dig. ?? I tried to find a reason and I think I found something interesting. The change is related to the normalization factor and the fact that those queries are boosted. When you use a boolean query with no clause the normalization factor is 0, when the matchnodocs query is used the normalization factor is 1 (BooleanWeight.getValueForNormalization and ConstantScoreWeight.getValueForNormalization). This part of the query is supposed to return no documents so it should be ok to ignore it when the query norm is computed. Though for the distributed case where results are merged from different shards there is no guarantee that the rewrite will be the same among the shards. I think we can get rid of the matchnodocsquery vs empty boolean query difference if we change the return value of BooleanWeight.getValueForNormalization to be 1 (instead of 0) when there is no clause. https://issues.apache.org/jira/browse/LUCENE-7337 > Add an optional reason to the MatchNoDocsQuery > -- > > Key: LUCENE-7276 > URL: https://issues.apache.org/jira/browse/LUCENE-7276 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Ferenczi Jim >Priority: Minor > Labels: patch > Attachments: LUCENE-7276.patch, LUCENE-7276.patch, LUCENE-7276.patch, > LUCENE-7276.patch, LUCENE-7276.patch > > > It's sometimes difficult to debug a query that results in a MatchNoDocsQuery. > The MatchNoDocsQuery is always rewritten in an empty boolean query. > This patch adds an optional reason and implements a weight in order to keep > track of the reason why the query did not match any document. The reason is > printed on toString and when an explanation for noMatch is asked. > For instance the query: > new MatchNoDocsQuery("Field not found").toString() > => 'MatchNoDocsQuery["field 'title' not found"]' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-7337) MultiTermQuery are sometimes rewritten into an empty boolean query
Ferenczi Jim created LUCENE-7337: Summary: MultiTermQuery are sometimes rewritten into an empty boolean query Key: LUCENE-7337 URL: https://issues.apache.org/jira/browse/LUCENE-7337 Project: Lucene - Core Issue Type: Bug Components: core/search Reporter: Ferenczi Jim Priority: Minor MultiTermQuery are sometimes rewritten to an empty boolean query (depending on the rewrite method), it can happen when no expansions are found on a fuzzy query for instance. It can be problematic when the multi term query is boosted. For instance consider the following query: `((title:bar~1)^100 text:bar)` This is a boolean query with two optional clauses. The first one is a fuzzy query on the field title with a boost of 100. If there is no expansion for "title:bar~1" the query is rewritten into: `(()^100 text:bar)` ... and when expansions are found: `((title:bars | title:bar)^100 text:bar)` The scoring of those two queries will differ because the normalization factor and the norm for the first query will be equal to 1 (the boost is ignored because the empty boolean query is not taken into account for the computation of the normalization factor) whereas the second query will have a normalization factor of 10,000 (100*100) and a norm equal to 0.01. This kind of discrepancy can happen in a single index because the expansions for the fuzzy query are done at the segment level. It can also happen when multiple indices are requested (Solr/ElasticSearch case). A simple fix would be to replace the empty boolean query produced by the multi term query with a MatchNoDocsQuery but I am not sure that it's the best way to fix. WDYT ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7276) Add an optional reason to the MatchNoDocsQuery
[ https://issues.apache.org/jira/browse/LUCENE-7276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-7276: - Attachment: LUCENE-7276.patch Patch available. > Add an optional reason to the MatchNoDocsQuery > -- > > Key: LUCENE-7276 > URL: https://issues.apache.org/jira/browse/LUCENE-7276 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Ferenczi Jim >Priority: Minor > Labels: patch > Attachments: LUCENE-7276.patch > > > It's sometimes difficult to debug a query that results in a MatchNoDocsQuery. > The MatchNoDocsQuery is always rewritten in an empty boolean query. > This patch adds an optional reason and implements a weight in order to keep > track of the reason why the query did not match any document. The reason is > printed on toString and when an explanation for noMatch is asked. > For instance the query: > new MatchNoDocsQuery("Field not found").toString() > => 'MatchNoDocsQuery["field 'title' not found"]' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-7276) Add an optional reason to the MatchNoDocsQuery
Ferenczi Jim created LUCENE-7276: Summary: Add an optional reason to the MatchNoDocsQuery Key: LUCENE-7276 URL: https://issues.apache.org/jira/browse/LUCENE-7276 Project: Lucene - Core Issue Type: Improvement Components: core/search Reporter: Ferenczi Jim Priority: Minor It's sometimes difficult to debug a query that results in a MatchNoDocsQuery. The MatchNoDocsQuery is always rewritten in an empty boolean query. This patch adds an optional reason and implements a weight in order to keep track of the reason why the query did not match any document. The reason is printed on toString and when an explanation for noMatch is asked. For instance the query: new MatchNoDocsQuery("Field not found").toString() => 'MatchNoDocsQuery["field 'title' not found"]' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-6972) QueryBuilder should not differentiate single position and multiple positions queries when the analyzer produces synonyms.
[ https://issues.apache.org/jira/browse/LUCENE-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105867#comment-15105867 ] Ferenczi Jim edited comment on LUCENE-6972 at 1/18/16 9:58 PM: --- Rewrote the patch after [~rcmuir] suggestion about checking coordination factor instead. [~jpountz] can you check ? was (Author: jim.ferenczi): Rewrote the patch after @rcmuir suggestion about checking coordination factor instead. [~jpountz] can you check ? > QueryBuilder should not differentiate single position and multiple positions > queries when the analyzer produces synonyms. > --- > > Key: LUCENE-6972 > URL: https://issues.apache.org/jira/browse/LUCENE-6972 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 5.4, 5.5 >Reporter: Ferenczi Jim >Assignee: Adrien Grand > Fix For: 5.5 > > Attachments: LUCENE-6972.patch, LUCENE-6972.patch > > > When synonyms are involved the querybuilder differentiate two cases. When > there is only one position the query is composed of one BooleanQuery which > contains multiple should clauses. This does not interact well when trying to > apply a minimum_should_match to the query. For instance if a field has a > synonym rule like "foo,bar" the query "foo" will produce: > bq. (foo bar) > ... two optional clauses at the root level. If we apply a minimum should > match of 50% then the query becomes: > bq. (foo bar)~1 > This seems wrong, the terms are at the same position. > IMO the querybuilder should produce the following query: > bq. ((foo bar)) > ... and a minimum should match of 50% should be not applicable to a query > with only one optional clause at the root level. > The case with multiple positions works as expected. > The user query "test foo" generates: > bq. (test (foo bar)) > ... and if we apply a minimum should match of 50%: > bq. (test (foo bar))~1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6972) QueryBuilder should not differentiate single position and multiple positions queries when the analyzer produces synonyms.
[ https://issues.apache.org/jira/browse/LUCENE-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-6972: - Attachment: LUCENE-6972.patch Rewrote the patch after @rcmuir suggestion about checking coordination factor instead. [~jpountz] can you check ? > QueryBuilder should not differentiate single position and multiple positions > queries when the analyzer produces synonyms. > --- > > Key: LUCENE-6972 > URL: https://issues.apache.org/jira/browse/LUCENE-6972 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 5.4, 5.5 >Reporter: Ferenczi Jim >Assignee: Adrien Grand > Fix For: 5.5 > > Attachments: LUCENE-6972.patch, LUCENE-6972.patch > > > When synonyms are involved the querybuilder differentiate two cases. When > there is only one position the query is composed of one BooleanQuery which > contains multiple should clauses. This does not interact well when trying to > apply a minimum_should_match to the query. For instance if a field has a > synonym rule like "foo,bar" the query "foo" will produce: > bq. (foo bar) > ... two optional clauses at the root level. If we apply a minimum should > match of 50% then the query becomes: > bq. (foo bar)~1 > This seems wrong, the terms are at the same position. > IMO the querybuilder should produce the following query: > bq. ((foo bar)) > ... and a minimum should match of 50% should be not applicable to a query > with only one optional clause at the root level. > The case with multiple positions works as expected. > The user query "test foo" generates: > bq. (test (foo bar)) > ... and if we apply a minimum should match of 50%: > bq. (test (foo bar))~1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6972) QueryBuilder should not differentiate single position and multiple positions queries when the analyzer produces synonyms.
[ https://issues.apache.org/jira/browse/LUCENE-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096811#comment-15096811 ] Ferenczi Jim commented on LUCENE-6972: -- Sorry I forgot to add a comment about merging into trunk. I don't think it is needed because the new SynonymQuery packs all the synonyms in the same query so minimum should match is not affected. Though there would be things to do (not in this issue) to handle single word synonyms that appear in a multi position query with a SynonymQuery (the analyzeMultiBoolean does not use the SynonymQuery). > QueryBuilder should not differentiate single position and multiple positions > queries when the analyzer produces synonyms. > --- > > Key: LUCENE-6972 > URL: https://issues.apache.org/jira/browse/LUCENE-6972 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 5.4, 5.5 >Reporter: Ferenczi Jim >Assignee: Adrien Grand > Fix For: 5.5 > > Attachments: LUCENE-6972.patch, LUCENE-6972.patch > > > When synonyms are involved the querybuilder differentiate two cases. When > there is only one position the query is composed of one BooleanQuery which > contains multiple should clauses. This does not interact well when trying to > apply a minimum_should_match to the query. For instance if a field has a > synonym rule like "foo,bar" the query "foo" will produce: > bq. (foo bar) > ... two optional clauses at the root level. If we apply a minimum should > match of 50% then the query becomes: > bq. (foo bar)~1 > This seems wrong, the terms are at the same position. > IMO the querybuilder should produce the following query: > bq. ((foo bar)) > ... and a minimum should match of 50% should be not applicable to a query > with only one optional clause at the root level. > The case with multiple positions works as expected. > The user query "test foo" generates: > bq. (test (foo bar)) > ... and if we apply a minimum should match of 50%: > bq. (test (foo bar))~1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6972) QueryBuilder should not differentiate single position and multiple positions queries when the analyzer produces synonyms.
[ https://issues.apache.org/jira/browse/LUCENE-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-6972: - Lucene Fields: New,Patch Available (was: New) > QueryBuilder should not differentiate single position and multiple positions > queries when the analyzer produces synonyms. > --- > > Key: LUCENE-6972 > URL: https://issues.apache.org/jira/browse/LUCENE-6972 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 5.4, 5.5 >Reporter: Ferenczi Jim > Fix For: 5.5 > > Attachments: LUCENE-6972.patch > > > When synonyms are involved the querybuilder differentiate two cases. When > there is only one position the query is composed of one BooleanQuery which > contains multiple should clauses. This does not interact well when trying to > apply a minimum_should_match to the query. For instance if a field has a > synonym rule like "foo,bar" the query "foo" will produce: > bq. (foo bar) > ... two optional clauses at the root level. If we apply a minimum should > match of 50% then the query becomes: > bq. (foo bar)~1 > This seems wrong, the terms are at the same position. > IMO the querybuilder should produce the following query: > bq. ((foo bar)) > ... and a minimum should match of 50% should be not applicable to a query > with only one optional clause at the root level. > The case with multiple positions works as expected. > The user query "test foo" generates: > bq. (test (foo bar)) > ... and if we apply a minimum should match of 50%: > bq. (test (foo bar))~1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6972) QueryBuilder should not differentiate single position and multiple positions queries when the analyzer produces synonyms.
[ https://issues.apache.org/jira/browse/LUCENE-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-6972: - Attachment: LUCENE-6972.patch > QueryBuilder should not differentiate single position and multiple positions > queries when the analyzer produces synonyms. > --- > > Key: LUCENE-6972 > URL: https://issues.apache.org/jira/browse/LUCENE-6972 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 5.4, 5.5 >Reporter: Ferenczi Jim > Fix For: 5.5 > > Attachments: LUCENE-6972.patch > > > When synonyms are involved the querybuilder differentiate two cases. When > there is only one position the query is composed of one BooleanQuery which > contains multiple should clauses. This does not interact well when trying to > apply a minimum_should_match to the query. For instance if a field has a > synonym rule like "foo,bar" the query "foo" will produce: > bq. (foo bar) > ... two optional clauses at the root level. If we apply a minimum should > match of 50% then the query becomes: > bq. (foo bar)~1 > This seems wrong, the terms are at the same position. > IMO the querybuilder should produce the following query: > bq. ((foo bar)) > ... and a minimum should match of 50% should be not applicable to a query > with only one optional clause at the root level. > The case with multiple positions works as expected. > The user query "test foo" generates: > bq. (test (foo bar)) > ... and if we apply a minimum should match of 50%: > bq. (test (foo bar))~1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6972) QueryBuilder should not differentiate single position and multiple positions queries when the analyzer produces synonyms.
[ https://issues.apache.org/jira/browse/LUCENE-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenczi Jim updated LUCENE-6972: - Description: When synonyms are involved the querybuilder differentiate two cases. When there is only one position the query is composed of one BooleanQuery which contains multiple should clauses. This does not interact well when trying to apply a minimum_should_match to the query. For instance if a field has a synonym rule like "foo,bar" the query "foo" will produce: bq. (foo bar) ... two optional clauses at the root level. If we apply a minimum should match of 50% then the query becomes: bq. (foo bar)~1 This seems wrong, the terms are at the same position. IMO the querybuilder should produce the following query: bq. ((foo bar)) ... and a minimum should match of 50% should be not applicable to a query with only one optional clause at the root level. The case with multiple positions works as expected. The user query "test foo" generates: bq. (test (foo bar)) ... and if we apply a minimum should match of 50%: bq. (test (foo bar))~1 was: When synonyms are involved the querybuilder differentiate two cases. When there is only one position the query is composed of one BooleanQuery which contains multiple should clauses. This does not interact well when trying to apply a minimum_should_match to the query. For instance if a field has a synonym rule like "foo,bar" the query "foo" will produce: "(foo bar)" ... two optional clauses at the root level. If we apply a minimum should match of 50% then the query becomes: "(foo bar)~1". This seems wrong, the terms are at the same position. IMO the querybuilder should produce the following query: "((foo bar))" ... and a minimum should match of 50% should be not applicable to a query with only one optional clause at the root level. The case with multiple positions works as expected. The user query "test foo" generates: "(test (foo bar))" ... and if we apply a minimum should match of 50%: "(test (foo bar))~1" > QueryBuilder should not differentiate single position and multiple positions > queries when the analyzer produces synonyms. > --- > > Key: LUCENE-6972 > URL: https://issues.apache.org/jira/browse/LUCENE-6972 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 5.4, 5.5 >Reporter: Ferenczi Jim > Fix For: 5.5 > > > When synonyms are involved the querybuilder differentiate two cases. When > there is only one position the query is composed of one BooleanQuery which > contains multiple should clauses. This does not interact well when trying to > apply a minimum_should_match to the query. For instance if a field has a > synonym rule like "foo,bar" the query "foo" will produce: > bq. (foo bar) > ... two optional clauses at the root level. If we apply a minimum should > match of 50% then the query becomes: > bq. (foo bar)~1 > This seems wrong, the terms are at the same position. > IMO the querybuilder should produce the following query: > bq. ((foo bar)) > ... and a minimum should match of 50% should be not applicable to a query > with only one optional clause at the root level. > The case with multiple positions works as expected. > The user query "test foo" generates: > bq. (test (foo bar)) > ... and if we apply a minimum should match of 50%: > bq. (test (foo bar))~1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-6972) QueryBuilder should not differentiate single position and multiple positions queries when the analyzer produces synonyms.
Ferenczi Jim created LUCENE-6972: Summary: QueryBuilder should not differentiate single position and multiple positions queries when the analyzer produces synonyms. Key: LUCENE-6972 URL: https://issues.apache.org/jira/browse/LUCENE-6972 Project: Lucene - Core Issue Type: Bug Affects Versions: 5.4, 5.5 Reporter: Ferenczi Jim Fix For: 5.5 When synonyms are involved the querybuilder differentiate two cases. When there is only one position the query is composed of one BooleanQuery which contains multiple should clauses. This does not interact well when trying to apply a minimum_should_match to the query. For instance if a field has a synonym rule like "foo,bar" the query "foo" will produce: "(foo bar)" ... two optional clauses at the root level. If we apply a minimum should match of 50% then the query becomes: "(foo bar)~1". This seems wrong, the terms are at the same position. IMO the querybuilder should produce the following query: "((foo bar))" ... and a minimum should match of 50% should be not applicable to a query with only one optional clause at the root level. The case with multiple positions works as expected. The user query "test foo" generates: "(test (foo bar))" ... and if we apply a minimum should match of 50%: "(test (foo bar))~1" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7319) Workaround the "Four Month Bug" causing GC pause problems
[ https://issues.apache.org/jira/browse/SOLR-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390220#comment-14390220 ] Ferenczi Jim commented on SOLR-7319: Thanks [~elyograg]. We are big fans of your pages about the settings for Solr regarding the Garbage Collector. We changed a lot of our settings after reading your page and we are know happy with the GC performance in our setup. I guess that providing good defaults values for all use cases is almost impossible and that each deployment/use cases would need a round of testing to find optimal values (especially for the tenuring threshold and the size of the heap). Anyway I think that most of the Solr users would be happy to have default values optimized by Solr expert. For those who think that they can have better performance with other settings nothing prevent them to change those defaults ;) My initial point was that the defaults options should not break any external tool accessing Solr especially if it prevents the user to monitor the GC with jstat. > Workaround the "Four Month Bug" causing GC pause problems > - > > Key: SOLR-7319 > URL: https://issues.apache.org/jira/browse/SOLR-7319 > Project: Solr > Issue Type: Bug > Components: scripts and tools >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shawn Heisey > Attachments: SOLR-7319.patch, SOLR-7319.patch, SOLR-7319.patch > > > A twitter engineer found a bug in the JVM that contributes to GC pause > problems: > http://www.evanjones.ca/jvm-mmap-pause.html > Problem summary (in case the blog post disappears): The JVM calculates > statistics on things like garbage collection and writes them to a file in the > temp directory using MMAP. If there is a lot of other MMAP write activity, > which is precisely how Lucene accomplishes indexing and merging, it can > result in a GC pause because the mmap write to the temp file is delayed. > We should implement the workaround in the solr start scripts (disable > creation of the mmap statistics tempfile) and document the impact in > CHANGES.txt. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7319) Workaround the "Four Month Bug" causing GC pause problems
[ https://issues.apache.org/jira/browse/SOLR-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388154#comment-14388154 ] Ferenczi Jim commented on SOLR-7319: Most of the java options in the solr.in.cmd should not be activated by default. The tenuring threshold, the numbers of threads for the GC, ..., they all depend on the type of deployment you have, the size of the heap and the machine hosting the Solr node. In my company we are using a custom script full of java options that we added over the years. Most of the options are here because somebody added this with the assertion that the performance are better. Most of the time, we don't know what the option is for but nobody wants to remove it because the urban legend says it's useful. The solr startup script should be almost empty (at least for the java options), maybe one or two options to set up the garbage collector and that's it. > Workaround the "Four Month Bug" causing GC pause problems > - > > Key: SOLR-7319 > URL: https://issues.apache.org/jira/browse/SOLR-7319 > Project: Solr > Issue Type: Bug > Components: scripts and tools >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shawn Heisey > Fix For: 5.1 > > Attachments: SOLR-7319.patch, SOLR-7319.patch, SOLR-7319.patch > > > A twitter engineer found a bug in the JVM that contributes to GC pause > problems: > http://www.evanjones.ca/jvm-mmap-pause.html > Problem summary (in case the blog post disappears): The JVM calculates > statistics on things like garbage collection and writes them to a file in the > temp directory using MMAP. If there is a lot of other MMAP write activity, > which is precisely how Lucene accomplishes indexing and merging, it can > result in a GC pause because the mmap write to the temp file is delayed. > We should implement the workaround in the solr start scripts (disable > creation of the mmap statistics tempfile) and document the impact in > CHANGES.txt. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7319) Workaround the "Four Month Bug" causing GC pause problems
[ https://issues.apache.org/jira/browse/SOLR-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386786#comment-14386786 ] Ferenczi Jim commented on SOLR-7319: I am saying this because if we are not sure that Lucene is impacted we should not add this in the default options. Not being able to do a jstat on a running node is problematic and will break a lot of monitoring tools built on top of Solr. > Workaround the "Four Month Bug" causing GC pause problems > - > > Key: SOLR-7319 > URL: https://issues.apache.org/jira/browse/SOLR-7319 > Project: Solr > Issue Type: Bug > Components: scripts and tools >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shawn Heisey > Fix For: 5.1 > > Attachments: SOLR-7319.patch, SOLR-7319.patch, SOLR-7319.patch > > > A twitter engineer found a bug in the JVM that contributes to GC pause > problems: > http://www.evanjones.ca/jvm-mmap-pause.html > Problem summary (in case the blog post disappears): The JVM calculates > statistics on things like garbage collection and writes them to a file in the > temp directory using MMAP. If there is a lot of other MMAP write activity, > which is precisely how Lucene accomplishes indexing and merging, it can > result in a GC pause because the mmap write to the temp file is delayed. > We should implement the workaround in the solr start scripts (disable > creation of the mmap statistics tempfile) and document the impact in > CHANGES.txt. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7319) Workaround the "Four Month Bug" causing GC pause problems
[ https://issues.apache.org/jira/browse/SOLR-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386787#comment-14386787 ] Ferenczi Jim commented on SOLR-7319: I am saying this because if we are not sure that Lucene is impacted we should not add this in the default options. Not being able to do a jstat on a running node is problematic and will break a lot of monitoring tools built on top of Solr. > Workaround the "Four Month Bug" causing GC pause problems > - > > Key: SOLR-7319 > URL: https://issues.apache.org/jira/browse/SOLR-7319 > Project: Solr > Issue Type: Bug > Components: scripts and tools >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shawn Heisey > Fix For: 5.1 > > Attachments: SOLR-7319.patch, SOLR-7319.patch, SOLR-7319.patch > > > A twitter engineer found a bug in the JVM that contributes to GC pause > problems: > http://www.evanjones.ca/jvm-mmap-pause.html > Problem summary (in case the blog post disappears): The JVM calculates > statistics on things like garbage collection and writes them to a file in the > temp directory using MMAP. If there is a lot of other MMAP write activity, > which is precisely how Lucene accomplishes indexing and merging, it can > result in a GC pause because the mmap write to the temp file is delayed. > We should implement the workaround in the solr start scripts (disable > creation of the mmap statistics tempfile) and document the impact in > CHANGES.txt. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7319) Workaround the "Four Month Bug" causing GC pause problems
[ https://issues.apache.org/jira/browse/SOLR-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386754#comment-14386754 ] Ferenczi Jim commented on SOLR-7319: "If there is a lot of other MMAP write activity, which is precisely how Lucene accomplishes indexing and merging" => Are you sure about this statement, MMapDirectory uses MMap for reads and a simple RandomAccessFile for writes. I don't know how the RandomAccessFile is implemented but I doubt it's using MMap at all. > Workaround the "Four Month Bug" causing GC pause problems > - > > Key: SOLR-7319 > URL: https://issues.apache.org/jira/browse/SOLR-7319 > Project: Solr > Issue Type: Bug > Components: scripts and tools >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shawn Heisey > Fix For: 5.1 > > Attachments: SOLR-7319.patch, SOLR-7319.patch, SOLR-7319.patch > > > A twitter engineer found a bug in the JVM that contributes to GC pause > problems: > http://www.evanjones.ca/jvm-mmap-pause.html > Problem summary (in case the blog post disappears): The JVM calculates > statistics on things like garbage collection and writes them to a file in the > temp directory using MMAP. If there is a lot of other MMAP write activity, > which is precisely how Lucene accomplishes indexing and merging, it can > result in a GC pause because the mmap write to the temp file is delayed. > We should implement the workaround in the solr start scripts (disable > creation of the mmap statistics tempfile) and document the impact in > CHANGES.txt. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6606) In cloud mode the leader should distribute autoCommits to it's replicas
[ https://issues.apache.org/jira/browse/SOLR-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169142#comment-14169142 ] Ferenczi Jim commented on SOLR-6606: Is is intended for partial recovery or only to distribute the downloads on a full recovery (when the Replica downloads a full index from the master) ? I am saying this because for the partial recovery you need to handle the deletes as well. For instance if one replica missed the last commit he could download the segment from the master but he would loose all the deletes related to this update. Keeping the list of every delete per commit seems mandatory but also very expensive unless we can garbage collect them at some point. > In cloud mode the leader should distribute autoCommits to it's replicas > --- > > Key: SOLR-6606 > URL: https://issues.apache.org/jira/browse/SOLR-6606 > Project: Solr > Issue Type: Improvement >Reporter: Varun Thacker > Fix For: 5.0, Trunk > > Attachments: SOLR-6606.patch, SOLR-6606.patch > > > Today in SolrCloud different replicas of a shard can trigger auto (hard) > commits at different times. Although the documents which get added to the > system remain consistent the way the segments gets formed can be different > because of this. > The downside of segments not getting formed in an identical fashion across > replicas is that when a replica goes into recovery chances are that it has to > do a full index replication from the leader. This is time consuming and we > can possibly avoid this if the leader forwards auto (hard) commit commands to > it's replicas and the replicas never explicitly trigger an auto (hard) commit. > I am working on a patch. Should have it up shortly -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org