[jira] [Commented] (LUCENE-7579) Sorting on flushed segment

2016-12-20 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15764205#comment-15764205
 ] 

Ferenczi Jim commented on LUCENE-7579:
--

{quote}
Maybe you can work on the 6.x back port in the meantime 
{quote}

I am on it !

> Sorting on flushed segment
> --
>
> Key: LUCENE-7579
> URL: https://issues.apache.org/jira/browse/LUCENE-7579
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ferenczi Jim
>
> Today flushed segments built by an index writer with an index sort specified 
> are not sorted. The merge is responsible of sorting these segments 
> potentially with others that are already sorted (resulted from another 
> merge). 
> I'd like to investigate the cost of sorting the segment directly during the 
> flush. This could make the merge faster since they are some cheap 
> optimizations that can be done only if all segments to be merged are sorted.
>  For instance the merge of the points could use the bulk merge instead of 
> rebuilding the points from scratch.
> I made a small prototype which sort the segment on flush here:
> https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort
> The idea is simple, for points, norms, docvalues and terms I use the 
> SortingLeafReader implementation to translate the values that we have in RAM 
> in a sorted enumeration for the writers.
> For stored fields I use a two pass scheme where the documents are first 
> written to disk unsorted and then copied to another file with the correct 
> sorting. I use the same stored field format for the two steps and just remove 
> the file produced by the first pass at the end of the process.
> This prototype has no implementation for index sorting that use term vectors 
> yet. I'll add this later if the tests are good enough.
> Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts 
> and compared master with index sorting against my branch with index sorting 
> on flush. I tried with sparsetaxis and wikipedia and the first results are 
> weird. When I use the SerialScheduler and only one thread to write the docs,  
> index sorting on flush is slower. But when I use two threads the sorting on 
> flush is much faster even with the SerialScheduler. I'll continue to run the 
> tests in order to be able to share something more meaningful.
> The tests are passing except one about concurrent DV updates. I don't know 
> this part at all so I did not fix the test yet. I don't even know if we can 
> make it work with index sorting ;).
>  [~mikemccand] I would love to have your feedback about the prototype. Could 
> you please take a look ? I am sure there are plenty of bugs, ... but I think 
> it's a good start to evaluate the feasibility of this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7579) Sorting on flushed segment

2016-12-20 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15764001#comment-15764001
 ] 

Ferenczi Jim commented on LUCENE-7579:
--

Thanks [~jpountz] and [~mikemccand] !

> Sorting on flushed segment
> --
>
> Key: LUCENE-7579
> URL: https://issues.apache.org/jira/browse/LUCENE-7579
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ferenczi Jim
>
> Today flushed segments built by an index writer with an index sort specified 
> are not sorted. The merge is responsible of sorting these segments 
> potentially with others that are already sorted (resulted from another 
> merge). 
> I'd like to investigate the cost of sorting the segment directly during the 
> flush. This could make the merge faster since they are some cheap 
> optimizations that can be done only if all segments to be merged are sorted.
>  For instance the merge of the points could use the bulk merge instead of 
> rebuilding the points from scratch.
> I made a small prototype which sort the segment on flush here:
> https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort
> The idea is simple, for points, norms, docvalues and terms I use the 
> SortingLeafReader implementation to translate the values that we have in RAM 
> in a sorted enumeration for the writers.
> For stored fields I use a two pass scheme where the documents are first 
> written to disk unsorted and then copied to another file with the correct 
> sorting. I use the same stored field format for the two steps and just remove 
> the file produced by the first pass at the end of the process.
> This prototype has no implementation for index sorting that use term vectors 
> yet. I'll add this later if the tests are good enough.
> Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts 
> and compared master with index sorting against my branch with index sorting 
> on flush. I tried with sparsetaxis and wikipedia and the first results are 
> weird. When I use the SerialScheduler and only one thread to write the docs,  
> index sorting on flush is slower. But when I use two threads the sorting on 
> flush is much faster even with the SerialScheduler. I'll continue to run the 
> tests in order to be able to share something more meaningful.
> The tests are passing except one about concurrent DV updates. I don't know 
> this part at all so I did not fix the test yet. I don't even know if we can 
> make it work with index sorting ;).
>  [~mikemccand] I would love to have your feedback about the prototype. Could 
> you please take a look ? I am sure there are plenty of bugs, ... but I think 
> it's a good start to evaluate the feasibility of this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7579) Sorting on flushed segment

2016-12-19 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15760882#comment-15760882
 ] 

Ferenczi Jim commented on LUCENE-7579:
--

I pushed another commit that removes the specialized API for sorting a 
StoredFieldsWriter. This is now done directly in the StoredFieldsConsumer with 
a custom CopyVisitor (copied from MergeVisitor).
I've also added some asserts that check if unsorted segments were built from 
version prior to Lucene 7.0. We'll need to change the assert when this gets 
backported to 6.x. I could not add the assert on maybeSortReaders because 
IndexWriter.addIndexes uses the merge to add indices that could be unsorted. I 
don't know if this should be allowed or not but we can revisit this later. 
Other than that I think it's ready !
 

> Sorting on flushed segment
> --
>
> Key: LUCENE-7579
> URL: https://issues.apache.org/jira/browse/LUCENE-7579
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ferenczi Jim
>
> Today flushed segments built by an index writer with an index sort specified 
> are not sorted. The merge is responsible of sorting these segments 
> potentially with others that are already sorted (resulted from another 
> merge). 
> I'd like to investigate the cost of sorting the segment directly during the 
> flush. This could make the merge faster since they are some cheap 
> optimizations that can be done only if all segments to be merged are sorted.
>  For instance the merge of the points could use the bulk merge instead of 
> rebuilding the points from scratch.
> I made a small prototype which sort the segment on flush here:
> https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort
> The idea is simple, for points, norms, docvalues and terms I use the 
> SortingLeafReader implementation to translate the values that we have in RAM 
> in a sorted enumeration for the writers.
> For stored fields I use a two pass scheme where the documents are first 
> written to disk unsorted and then copied to another file with the correct 
> sorting. I use the same stored field format for the two steps and just remove 
> the file produced by the first pass at the end of the process.
> This prototype has no implementation for index sorting that use term vectors 
> yet. I'll add this later if the tests are good enough.
> Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts 
> and compared master with index sorting against my branch with index sorting 
> on flush. I tried with sparsetaxis and wikipedia and the first results are 
> weird. When I use the SerialScheduler and only one thread to write the docs,  
> index sorting on flush is slower. But when I use two threads the sorting on 
> flush is much faster even with the SerialScheduler. I'll continue to run the 
> tests in order to be able to share something more meaningful.
> The tests are passing except one about concurrent DV updates. I don't know 
> this part at all so I did not fix the test yet. I don't even know if we can 
> make it work with index sorting ;).
>  [~mikemccand] I would love to have your feedback about the prototype. Could 
> you please take a look ? I am sure there are plenty of bugs, ... but I think 
> it's a good start to evaluate the feasibility of this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7579) Sorting on flushed segment

2016-12-16 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15753993#comment-15753993
 ] 

Ferenczi Jim commented on LUCENE-7579:
--

This new API is maybe a premature optim that should not be part of this change. 
What about removing the API and rollback to a non optimized copy that "visits" 
each doc and copy it like the StoredFieldsReader is doing? This way the 
function would be private on the StoredFieldsConsumer. We can still add the 
optimization you're describing later but it can be confusing if the writes of 
the index writer are not compressed the same way than the other writes for 
stored fields ?

> Sorting on flushed segment
> --
>
> Key: LUCENE-7579
> URL: https://issues.apache.org/jira/browse/LUCENE-7579
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ferenczi Jim
>
> Today flushed segments built by an index writer with an index sort specified 
> are not sorted. The merge is responsible of sorting these segments 
> potentially with others that are already sorted (resulted from another 
> merge). 
> I'd like to investigate the cost of sorting the segment directly during the 
> flush. This could make the merge faster since they are some cheap 
> optimizations that can be done only if all segments to be merged are sorted.
>  For instance the merge of the points could use the bulk merge instead of 
> rebuilding the points from scratch.
> I made a small prototype which sort the segment on flush here:
> https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort
> The idea is simple, for points, norms, docvalues and terms I use the 
> SortingLeafReader implementation to translate the values that we have in RAM 
> in a sorted enumeration for the writers.
> For stored fields I use a two pass scheme where the documents are first 
> written to disk unsorted and then copied to another file with the correct 
> sorting. I use the same stored field format for the two steps and just remove 
> the file produced by the first pass at the end of the process.
> This prototype has no implementation for index sorting that use term vectors 
> yet. I'll add this later if the tests are good enough.
> Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts 
> and compared master with index sorting against my branch with index sorting 
> on flush. I tried with sparsetaxis and wikipedia and the first results are 
> weird. When I use the SerialScheduler and only one thread to write the docs,  
> index sorting on flush is slower. But when I use two threads the sorting on 
> flush is much faster even with the SerialScheduler. I'll continue to run the 
> tests in order to be able to share something more meaningful.
> The tests are passing except one about concurrent DV updates. I don't know 
> this part at all so I did not fix the test yet. I don't even know if we can 
> make it work with index sorting ;).
>  [~mikemccand] I would love to have your feedback about the prototype. Could 
> you please take a look ? I am sure there are plenty of bugs, ... but I think 
> it's a good start to evaluate the feasibility of this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7579) Sorting on flushed segment

2016-12-16 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15753913#comment-15753913
 ] 

Ferenczi Jim commented on LUCENE-7579:
--

{quote}
CompressingStoredFieldsWriter.sort should always have a 
CompressingStoredFieldsReader as an input, since the codec cannot change in the 
middle of the flush, so I think we should be able to skip the instanceof check?
{quote}

That's true for the only call we make to this new API but since it's public it 
could be call with a different fields reader in another use case ? I am not 
happy that I had to add this new public API in the StoredFieldsReader but it's 
the only way to make this optimized for the compressing case. 

{quote}
It would personally help me to have comments eg. in MergeState.maybeSortReaders 
that the indexSort==null case may only happen for bwc reasons. Maybe we should 
also assert that if index sorting is configured, then the non-sorted segments 
can only have 6.2 or 6.3 as a version
{quote}

Agreed, I'll add an assert for the non-sorted case. I'll also add a comment to 
make it clear that index==null is handled for BWC reason in maybeSortReader.

Thanks for having a look [~jpountz]




> Sorting on flushed segment
> --
>
> Key: LUCENE-7579
> URL: https://issues.apache.org/jira/browse/LUCENE-7579
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ferenczi Jim
>
> Today flushed segments built by an index writer with an index sort specified 
> are not sorted. The merge is responsible of sorting these segments 
> potentially with others that are already sorted (resulted from another 
> merge). 
> I'd like to investigate the cost of sorting the segment directly during the 
> flush. This could make the merge faster since they are some cheap 
> optimizations that can be done only if all segments to be merged are sorted.
>  For instance the merge of the points could use the bulk merge instead of 
> rebuilding the points from scratch.
> I made a small prototype which sort the segment on flush here:
> https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort
> The idea is simple, for points, norms, docvalues and terms I use the 
> SortingLeafReader implementation to translate the values that we have in RAM 
> in a sorted enumeration for the writers.
> For stored fields I use a two pass scheme where the documents are first 
> written to disk unsorted and then copied to another file with the correct 
> sorting. I use the same stored field format for the two steps and just remove 
> the file produced by the first pass at the end of the process.
> This prototype has no implementation for index sorting that use term vectors 
> yet. I'll add this later if the tests are good enough.
> Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts 
> and compared master with index sorting against my branch with index sorting 
> on flush. I tried with sparsetaxis and wikipedia and the first results are 
> weird. When I use the SerialScheduler and only one thread to write the docs,  
> index sorting on flush is slower. But when I use two threads the sorting on 
> flush is much faster even with the SerialScheduler. I'll continue to run the 
> tests in order to be able to share something more meaningful.
> The tests are passing except one about concurrent DV updates. I don't know 
> this part at all so I did not fix the test yet. I don't even know if we can 
> make it work with index sorting ;).
>  [~mikemccand] I would love to have your feedback about the prototype. Could 
> you please take a look ? I am sure there are plenty of bugs, ... but I think 
> it's a good start to evaluate the feasibility of this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7579) Sorting on flushed segment

2016-12-15 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15750937#comment-15750937
 ] 

Ferenczi Jim commented on LUCENE-7579:
--

{quote}
We still need to wrap unsorted segments during the merge for BWC so 
SortingLeafReader should remain.
{quote}

We can still rewrite it to a SortingCodecReader and remove the 
SlowCodecReaderWrapper but that's another issue ;)

{quote}
I think we should push first to master, and let that bake some, and in the mean 
time work out the challenging 6.x back port?
{quote}

Agreed. I'll create a branch for the back port in my repo.

{quote}
I'll wait a day or so before committing to give others a chance to review; it's 
a large change.
{quote}

That's awesome [~mikemccand] ! Thanks for the review and testing.

 

> Sorting on flushed segment
> --
>
> Key: LUCENE-7579
> URL: https://issues.apache.org/jira/browse/LUCENE-7579
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ferenczi Jim
>
> Today flushed segments built by an index writer with an index sort specified 
> are not sorted. The merge is responsible of sorting these segments 
> potentially with others that are already sorted (resulted from another 
> merge). 
> I'd like to investigate the cost of sorting the segment directly during the 
> flush. This could make the merge faster since they are some cheap 
> optimizations that can be done only if all segments to be merged are sorted.
>  For instance the merge of the points could use the bulk merge instead of 
> rebuilding the points from scratch.
> I made a small prototype which sort the segment on flush here:
> https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort
> The idea is simple, for points, norms, docvalues and terms I use the 
> SortingLeafReader implementation to translate the values that we have in RAM 
> in a sorted enumeration for the writers.
> For stored fields I use a two pass scheme where the documents are first 
> written to disk unsorted and then copied to another file with the correct 
> sorting. I use the same stored field format for the two steps and just remove 
> the file produced by the first pass at the end of the process.
> This prototype has no implementation for index sorting that use term vectors 
> yet. I'll add this later if the tests are good enough.
> Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts 
> and compared master with index sorting against my branch with index sorting 
> on flush. I tried with sparsetaxis and wikipedia and the first results are 
> weird. When I use the SerialScheduler and only one thread to write the docs,  
> index sorting on flush is slower. But when I use two threads the sorting on 
> flush is much faster even with the SerialScheduler. I'll continue to run the 
> tests in order to be able to share something more meaningful.
> The tests are passing except one about concurrent DV updates. I don't know 
> this part at all so I did not fix the test yet. I don't even know if we can 
> make it work with index sorting ;).
>  [~mikemccand] I would love to have your feedback about the prototype. Could 
> you please take a look ? I am sure there are plenty of bugs, ... but I think 
> it's a good start to evaluate the feasibility of this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7579) Sorting on flushed segment

2016-12-13 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744895#comment-15744895
 ] 

Ferenczi Jim commented on LUCENE-7579:
--

I pushed another iteration to 
https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort

I cleaned up the nocommit and added the implementation for sorting term vectors.

{quote}
Do any of the exceptions tests for IndexWriter get angry? Seems like
if we hit an IOException e.g. during the renaming that
SortingStoredFieldsConsumer.flush does we may leave undeleted
files? Hmm or perhaps IW takes care of that by wrapping the directory
itself...
{quote}

I added an abort method on the StoredFieldsWriter which deletes the remaining 
temporary files and did the same for the SortingTermVectorsConsumer.

[~mikemccand] can you take a look ?


> Sorting on flushed segment
> --
>
> Key: LUCENE-7579
> URL: https://issues.apache.org/jira/browse/LUCENE-7579
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ferenczi Jim
>
> Today flushed segments built by an index writer with an index sort specified 
> are not sorted. The merge is responsible of sorting these segments 
> potentially with others that are already sorted (resulted from another 
> merge). 
> I'd like to investigate the cost of sorting the segment directly during the 
> flush. This could make the merge faster since they are some cheap 
> optimizations that can be done only if all segments to be merged are sorted.
>  For instance the merge of the points could use the bulk merge instead of 
> rebuilding the points from scratch.
> I made a small prototype which sort the segment on flush here:
> https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort
> The idea is simple, for points, norms, docvalues and terms I use the 
> SortingLeafReader implementation to translate the values that we have in RAM 
> in a sorted enumeration for the writers.
> For stored fields I use a two pass scheme where the documents are first 
> written to disk unsorted and then copied to another file with the correct 
> sorting. I use the same stored field format for the two steps and just remove 
> the file produced by the first pass at the end of the process.
> This prototype has no implementation for index sorting that use term vectors 
> yet. I'll add this later if the tests are good enough.
> Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts 
> and compared master with index sorting against my branch with index sorting 
> on flush. I tried with sparsetaxis and wikipedia and the first results are 
> weird. When I use the SerialScheduler and only one thread to write the docs,  
> index sorting on flush is slower. But when I use two threads the sorting on 
> flush is much faster even with the SerialScheduler. I'll continue to run the 
> tests in order to be able to share something more meaningful.
> The tests are passing except one about concurrent DV updates. I don't know 
> this part at all so I did not fix the test yet. I don't even know if we can 
> make it work with index sorting ;).
>  [~mikemccand] I would love to have your feedback about the prototype. Could 
> you please take a look ? I am sure there are plenty of bugs, ... but I think 
> it's a good start to evaluate the feasibility of this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7581) IndexWriter#updateDocValues can break index sorting

2016-12-09 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7581:
-
Attachment: LUCENE-7581.patch

Here is a patch that fails DV updates on a field involved in the index sort.
I also modified TestIndexSorting#testConcurrentDVUpdates which now test DV 
updates that are not involved in the index sort.

> IndexWriter#updateDocValues can break index sorting
> ---
>
> Key: LUCENE-7581
> URL: https://issues.apache.org/jira/browse/LUCENE-7581
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ferenczi Jim
> Attachments: LUCENE-7581.patch, LUCENE-7581.patch
>
>
> IndexWriter#updateDocValues can break index sorting if it is called on a 
> field that is used in the index sorting specification. 
> TestIndexSorting has a test for this case: #testConcurrentDVUpdates 
> but only L1 merge are checked. Any LN merge would fail the test because the 
> inner sort of the segment is not re-compute during/after DV updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7575) UnifiedHighlighter: add requireFieldMatch=false support

2016-12-05 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723586#comment-15723586
 ] 

Ferenczi Jim commented on LUCENE-7575:
--

{quote}
I was thinking a bit more about the wastefulness of re-creating SpanQueries 
with different field that are otherwise identical. Some day we could refactor 
out from WSTE a Query -> SpanQuery conversion utility that furthermore allows 
you to re-target the field. With that in place, we could avoid the waste for 
PhraseQuery and MultiPhraseQuery – the most typical position-sensitive queries.
{quote}

I agree, I'll work on this shortly. Thanks for the hint ;)

> UnifiedHighlighter: add requireFieldMatch=false support
> ---
>
> Key: LUCENE-7575
> URL: https://issues.apache.org/jira/browse/LUCENE-7575
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Assignee: David Smiley
> Fix For: 6.4
>
> Attachments: LUCENE-7575.patch, LUCENE-7575.patch, LUCENE-7575.patch
>
>
> The UnifiedHighlighter (like the PostingsHighlighter) only supports 
> highlighting queries for the same fields that are being highlighted.  The 
> original Highlighter and FVH support loosening this, AKA 
> requireFieldMatch=false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7579) Sorting on flushed segment

2016-12-05 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723545#comment-15723545
 ] 

Ferenczi Jim commented on LUCENE-7579:
--

Thanks Mike, 

{quote}
Can we rename freezed to frozen in BinaryDocValuesWriter?
But: why would freezed ever be true when we call flush?
Shouldn't it only be called once, even in the sorting case?
{quote}

This is a leftover that is not needed. The naming was wrong ;) and it's useless 
so I removed it.

{quote}
I also like how you were able to re-use the SortingXXX from
SortingLeafReader. Later on we can maybe optimize some of these;
e.g. SortingFields and CachedXXXDVs should be able to take
advantage of the fact that the things they are sorting are all already
in heap (the indexing buffer), the way you did with
MutableSortingPointValues (cool).
{quote}

Totally agree, we can revisit later and see if we can optimize memory. I think 
it's already an optim vs master in terms of memory usage since we only "sort" 
the segment to be flushed instead of all "unsorted" segments during the merge.

{quote}
Can we block creating a SortingLeafReader now (make its
constructor private)? We only now ever use its inner classes I think?
And it is a dangerous class in the first place... if we can do that,
maybe we rename it SortingCodecUtils or something, just for its
inner classes.
{quote}

We still need to wrap unsorted segments during the merge for BWC so 
SortingLeafReader should remain. I have no idea when we can remove it since 
indices on older versions should still be compatible with this new one ?


{quote}
Do any of the exceptions tests for IndexWriter get angry? Seems like
if we hit an IOException e.g. during the renaming that
SortingStoredFieldsConsumer.flush does we may leave undeleted
files? Hmm or perhaps IW takes care of that by wrapping the directory
itself...
{quote}

Honestly I have no idea. I will dig.

{quote}
Can't you just pass sortMap::newToOld directly (method reference)
instead of making the lambda here?:
{quote}

Indeed, thanks.

{quote}
I think the 6.x back port here is going to be especially tricky 
{quote}

I bet but as it is the main part is done by reusing SortingLeafReader inner 
classes that exist in 6.x. 

I've also removed a nocommit in the AssertingLiveDocsFormat that now checks 
live docs even when they are sorted.



 

> Sorting on flushed segment
> --
>
> Key: LUCENE-7579
> URL: https://issues.apache.org/jira/browse/LUCENE-7579
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ferenczi Jim
>
> Today flushed segments built by an index writer with an index sort specified 
> are not sorted. The merge is responsible of sorting these segments 
> potentially with others that are already sorted (resulted from another 
> merge). 
> I'd like to investigate the cost of sorting the segment directly during the 
> flush. This could make the merge faster since they are some cheap 
> optimizations that can be done only if all segments to be merged are sorted.
>  For instance the merge of the points could use the bulk merge instead of 
> rebuilding the points from scratch.
> I made a small prototype which sort the segment on flush here:
> https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort
> The idea is simple, for points, norms, docvalues and terms I use the 
> SortingLeafReader implementation to translate the values that we have in RAM 
> in a sorted enumeration for the writers.
> For stored fields I use a two pass scheme where the documents are first 
> written to disk unsorted and then copied to another file with the correct 
> sorting. I use the same stored field format for the two steps and just remove 
> the file produced by the first pass at the end of the process.
> This prototype has no implementation for index sorting that use term vectors 
> yet. I'll add this later if the tests are good enough.
> Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts 
> and compared master with index sorting against my branch with index sorting 
> on flush. I tried with sparsetaxis and wikipedia and the first results are 
> weird. When I use the SerialScheduler and only one thread to write the docs,  
> index sorting on flush is slower. But when I use two threads the sorting on 
> flush is much faster even with the SerialScheduler. I'll continue to run the 
> tests in order to be able to share something more meaningful.
> The tests are passing except one about concurrent DV updates. I don't know 
> this part at all so I did not fix the test yet. I don't even know if we can 
> make it work with index sorting ;).
>  [~mikemccand] I would love to have your feedback about the prototype. Could 
> you please take a look ? I am sure there are plenty of bugs, ... but I think 
> it's a good start to evaluate the feasibility of th

[jira] [Commented] (LUCENE-7575) UnifiedHighlighter: add requireFieldMatch=false support

2016-12-05 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722580#comment-15722580
 ] 

Ferenczi Jim commented on LUCENE-7575:
--

Thanks David !

> UnifiedHighlighter: add requireFieldMatch=false support
> ---
>
> Key: LUCENE-7575
> URL: https://issues.apache.org/jira/browse/LUCENE-7575
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Assignee: David Smiley
> Attachments: LUCENE-7575.patch, LUCENE-7575.patch, LUCENE-7575.patch
>
>
> The UnifiedHighlighter (like the PostingsHighlighter) only supports 
> highlighting queries for the same fields that are being highlighted.  The 
> original Highlighter and FVH support loosening this, AKA 
> requireFieldMatch=false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7575) UnifiedHighlighter: add requireFieldMatch=false support

2016-12-05 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7575:
-
Attachment: LUCENE-7575.patch

Thanks David !
Here is a new patch to address your last comments. Now we have a 
FieldFilteringTermSet and extractTerms uses a simple HashSet.

{quote}
couldn't defaultFieldMatcher be initialized to non-null to match the same 
field? Then getFieldMatcher() would simply return it.
{quote}

Not as a Predicate since the predicate is only on the candidate field 
name. We could use a BiPredicate to always provide the current 
field name to the predicate but I find it simpler this way. 


> UnifiedHighlighter: add requireFieldMatch=false support
> ---
>
> Key: LUCENE-7575
> URL: https://issues.apache.org/jira/browse/LUCENE-7575
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Assignee: David Smiley
> Attachments: LUCENE-7575.patch, LUCENE-7575.patch, LUCENE-7575.patch
>
>
> The UnifiedHighlighter (like the PostingsHighlighter) only supports 
> highlighting queries for the same fields that are being highlighted.  The 
> original Highlighter and FVH support loosening this, AKA 
> requireFieldMatch=false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7581) IndexWriter#updateDocValues can break index sorting

2016-12-05 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722476#comment-15722476
 ] 

Ferenczi Jim commented on LUCENE-7581:
--

[~mikemccand] I think so too. I'll work on a patch.

> IndexWriter#updateDocValues can break index sorting
> ---
>
> Key: LUCENE-7581
> URL: https://issues.apache.org/jira/browse/LUCENE-7581
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ferenczi Jim
> Attachments: LUCENE-7581.patch
>
>
> IndexWriter#updateDocValues can break index sorting if it is called on a 
> field that is used in the index sorting specification. 
> TestIndexSorting has a test for this case: #testConcurrentDVUpdates 
> but only L1 merge are checked. Any LN merge would fail the test because the 
> inner sort of the segment is not re-compute during/after DV updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7581) IndexWriter#updateDocValues can break index sorting

2016-12-05 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7581:
-
Attachment: LUCENE-7581.patch

I attached a patch that fails the test if a second round of DV updates are run.

> IndexWriter#updateDocValues can break index sorting
> ---
>
> Key: LUCENE-7581
> URL: https://issues.apache.org/jira/browse/LUCENE-7581
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ferenczi Jim
> Attachments: LUCENE-7581.patch
>
>
> IndexWriter#updateDocValues can break index sorting if it is called on a 
> field that is used in the index sorting specification. 
> TestIndexSorting has a test for this case: #testConcurrentDVUpdates 
> but only L1 merge are checked. Any LN merge would fail the test because the 
> inner sort of the segment is not re-compute during/after DV updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7581) IndexWriter#updateDocValues can break index sorting

2016-12-05 Thread Ferenczi Jim (JIRA)
Ferenczi Jim created LUCENE-7581:


 Summary: IndexWriter#updateDocValues can break index sorting
 Key: LUCENE-7581
 URL: https://issues.apache.org/jira/browse/LUCENE-7581
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ferenczi Jim


IndexWriter#updateDocValues can break index sorting if it is called on a field 
that is used in the index sorting specification. 
TestIndexSorting has a test for this case: #testConcurrentDVUpdates 
but only L1 merge are checked. Any LN merge would fail the test because the 
inner sort of the segment is not re-compute during/after DV updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7575) UnifiedHighlighter: add requireFieldMatch=false support

2016-12-05 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7575:
-
Attachment: LUCENE-7575.patch

Thanks [~dsmiley] and [~Timothy055] !

I pushed a new patch to address your comments. 

{quote}
 it'd be interesting if instead of a simple boolean toggle, if it were a 
Predicate fieldMatchPredicate so that only some fields could be 
collected in the query but not all. Just an idea.
{quote}

I agree and this is why I changed the patch to include your idea. By default 
nothing changes, queries are extracted based on the field name to highlight. 
Though with this change the user can now define which query (based on the field 
name) should be highlighted. I think it's better like this but I can revert if 
you think this should not implemented in the first iteration.

I fixed the bugs that David spotted (terms from different fields not sorted 
after filteredExtractTerms and redundant initialization of the filter leaf 
reader for the span queries) and split the tests based on the type of query 
that is tested.


> UnifiedHighlighter: add requireFieldMatch=false support
> ---
>
> Key: LUCENE-7575
> URL: https://issues.apache.org/jira/browse/LUCENE-7575
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Assignee: David Smiley
> Attachments: LUCENE-7575.patch, LUCENE-7575.patch
>
>
> The UnifiedHighlighter (like the PostingsHighlighter) only supports 
> highlighting queries for the same fields that are being highlighted.  The 
> original Highlighter and FVH support loosening this, AKA 
> requireFieldMatch=false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7579) Sorting on flushed segment

2016-12-01 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15712572#comment-15712572
 ] 

Ferenczi Jim commented on LUCENE-7579:
--

I ran the test from a clean state and I can see a nice improvement with the 
sparsetaxis use case. 

I use 
https://github.com/mikemccand/luceneutil/blob/master/src/python/sparsetaxis/runBenchmark.py
 and compare two checkouts of Lucene, one with my branch and the other with 
master.
For the master branch I have:
{noformat}
838.0 sec:  20.0 M docs;  23.9 K docs/sec
{noformat}

... vs the branch with the flush sort:
{noformat}
 612.2 sec:  20.0 M docs;  32.7 K docs/sec
{noformat}

I reproduce the same diff on each run :)




> Sorting on flushed segment
> --
>
> Key: LUCENE-7579
> URL: https://issues.apache.org/jira/browse/LUCENE-7579
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ferenczi Jim
>
> Today flushed segments built by an index writer with an index sort specified 
> are not sorted. The merge is responsible of sorting these segments 
> potentially with others that are already sorted (resulted from another 
> merge). 
> I'd like to investigate the cost of sorting the segment directly during the 
> flush. This could make the merge faster since they are some cheap 
> optimizations that can be done only if all segments to be merged are sorted.
>  For instance the merge of the points could use the bulk merge instead of 
> rebuilding the points from scratch.
> I made a small prototype which sort the segment on flush here:
> https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort
> The idea is simple, for points, norms, docvalues and terms I use the 
> SortingLeafReader implementation to translate the values that we have in RAM 
> in a sorted enumeration for the writers.
> For stored fields I use a two pass scheme where the documents are first 
> written to disk unsorted and then copied to another file with the correct 
> sorting. I use the same stored field format for the two steps and just remove 
> the file produced by the first pass at the end of the process.
> This prototype has no implementation for index sorting that use term vectors 
> yet. I'll add this later if the tests are good enough.
> Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts 
> and compared master with index sorting against my branch with index sorting 
> on flush. I tried with sparsetaxis and wikipedia and the first results are 
> weird. When I use the SerialScheduler and only one thread to write the docs,  
> index sorting on flush is slower. But when I use two threads the sorting on 
> flush is much faster even with the SerialScheduler. I'll continue to run the 
> tests in order to be able to share something more meaningful.
> The tests are passing except one about concurrent DV updates. I don't know 
> this part at all so I did not fix the test yet. I don't even know if we can 
> make it work with index sorting ;).
>  [~mikemccand] I would love to have your feedback about the prototype. Could 
> you please take a look ? I am sure there are plenty of bugs, ... but I think 
> it's a good start to evaluate the feasibility of this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7579) Sorting on flushed segment

2016-12-01 Thread Ferenczi Jim (JIRA)
Ferenczi Jim created LUCENE-7579:


 Summary: Sorting on flushed segment
 Key: LUCENE-7579
 URL: https://issues.apache.org/jira/browse/LUCENE-7579
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ferenczi Jim


Today flushed segments built by an index writer with an index sort specified 
are not sorted. The merge is responsible of sorting these segments potentially 
with others that are already sorted (resulted from another merge). 
I'd like to investigate the cost of sorting the segment directly during the 
flush. This could make the merge faster since they are some cheap optimizations 
that can be done only if all segments to be merged are sorted.
 For instance the merge of the points could use the bulk merge instead of 
rebuilding the points from scratch.
I made a small prototype which sort the segment on flush here:
https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort

The idea is simple, for points, norms, docvalues and terms I use the 
SortingLeafReader implementation to translate the values that we have in RAM in 
a sorted enumeration for the writers.
For stored fields I use a two pass scheme where the documents are first written 
to disk unsorted and then copied to another file with the correct sorting. I 
use the same stored field format for the two steps and just remove the file 
produced by the first pass at the end of the process.
This prototype has no implementation for index sorting that use term vectors 
yet. I'll add this later if the tests are good enough.
Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts and 
compared master with index sorting against my branch with index sorting on 
flush. I tried with sparsetaxis and wikipedia and the first results are weird. 
When I use the SerialScheduler and only one thread to write the docs,  index 
sorting on flush is slower. But when I use two threads the sorting on flush is 
much faster even with the SerialScheduler. I'll continue to run the tests in 
order to be able to share something more meaningful.

The tests are passing except one about concurrent DV updates. I don't know this 
part at all so I did not fix the test yet. I don't even know if we can make it 
work with index sorting ;).

 [~mikemccand] I would love to have your feedback about the prototype. Could 
you please take a look ? I am sure there are plenty of bugs, ... but I think 
it's a good start to evaluate the feasibility of this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7575) UnifiedHighlighter: add requireFieldMatch=false support

2016-11-29 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15706216#comment-15706216
 ] 

Ferenczi Jim commented on LUCENE-7575:
--

Hi [~dsmiley],
I've attached a patch based on the comment above. I did not find a clean way to 
detect duplicates in the span queries extracted by the PhraseHelper when 
requireFieldMatch=false. I agree that it's not essential so I pushed the patch 
as is. Could you please take a look ?

> UnifiedHighlighter: add requireFieldMatch=false support
> ---
>
> Key: LUCENE-7575
> URL: https://issues.apache.org/jira/browse/LUCENE-7575
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
> Attachments: LUCENE-7575.patch
>
>
> The UnifiedHighlighter (like the PostingsHighlighter) only supports 
> highlighting queries for the same fields that are being highlighted.  The 
> original Highlighter and FVH support loosening this, AKA 
> requireFieldMatch=false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7575) UnifiedHighlighter: add requireFieldMatch=false support

2016-11-29 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7575:
-
Attachment: LUCENE-7575.patch

Patch for requireFieldMatch

> UnifiedHighlighter: add requireFieldMatch=false support
> ---
>
> Key: LUCENE-7575
> URL: https://issues.apache.org/jira/browse/LUCENE-7575
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
> Attachments: LUCENE-7575.patch
>
>
> The UnifiedHighlighter (like the PostingsHighlighter) only supports 
> highlighting queries for the same fields that are being highlighted.  The 
> original Highlighter and FVH support loosening this, AKA 
> requireFieldMatch=false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7574) Another TestIndexSorting failures

2016-11-25 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695686#comment-15695686
 ] 

Ferenczi Jim commented on LUCENE-7574:
--

{quote}Can you explain the first patch a bit more? It felt correct to me to 
take whether the sort is reversed into account in order to compute the missing 
ordinal?{quote}

Yes but if we do it twice and that reverts the sort on missing ordinals, so my 
patch is just removing one of them. Line 176 in 
MultiSorter.CrossReaderComparator reverseMul is applied to the result of the 
ordinals comparison. This is ok to do it on missing ordinals as well but we 
already applied the reverseMul on missing ordinal line 160 so the result is 
reversed. I hope it makes sense.

> Another TestIndexSorting failures
> -
>
> Key: LUCENE-7574
> URL: https://issues.apache.org/jira/browse/LUCENE-7574
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 6.x, master (7.0)
>Reporter: Ferenczi Jim
> Attachments: LUCENE-7574-1.patch, LUCENE-7574-2.patch
>
>
> TestIndexSorting still fails with some seeds:
> {noformat}
>[junit4] Suite: org.apache.lucene.index.TestIndexSorting
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexSorting 
> -Dtests.method=testRandom3 -Dtests.seed=6E45BA611FCD7241 -Dtests.slow=true 
> -Dtests.locale=ko-KR -Dtests.timezone=America/Cordoba -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1
>[junit4] FAILURE 0.64s J1 | TestIndexSorting.testRandom3 <<<
>[junit4]> Throwable #1: org.junit.ComparisonFailure: expected:<[449]> 
> but was:<[650]>
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([6E45BA611FCD7241:CC9DF4BB7B3F5B47]:0)
>[junit4]>  at 
> org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2264)
>[junit4]>  at java.lang.Thread.run(Thread.java:745)
>[junit4]   2> NOTE: leaving temporary files on disk at: 
> /var/lib/jenkins/workspace/apache+lucene-solr+branch_6x/lucene/build/core/test/J1/temp/lucene.index.TestIndexSorting_6E45BA611FCD7241-001
>[junit4]   2> NOTE: test params are: codec=Asserting(Lucene62): 
> {docs=PostingsFormat(name=MockRandom), norms=PostingsFormat(name=MockRandom), 
> positions=PostingsFormat(name=Memory doPackFST= true), 
> id=PostingsFormat(name=MockRandom), term_vectors=FSTOrd50}, 
> docValues:{multi_valued_long=DocValuesFormat(name=Direct), 
> double=DocValuesFormat(name=Lucene54), foo=DocValuesFormat(name=Direct), 
> numeric=DocValuesFormat(name=Lucene54), 
> positions=DocValuesFormat(name=Direct), 
> multi_valued_numeric=DocValuesFormat(name=Memory), 
> float=DocValuesFormat(name=Lucene54), int=DocValuesFormat(name=Memory), 
> long=DocValuesFormat(name=Lucene54), points=DocValuesFormat(name=Memory), 
> sorted=DocValuesFormat(name=Lucene54), 
> multi_valued_double=DocValuesFormat(name=Memory), 
> docs=DocValuesFormat(name=Memory), 
> multi_valued_string=DocValuesFormat(name=Memory), 
> norms=DocValuesFormat(name=Memory), bytes=DocValuesFormat(name=Memory), 
> binary=DocValuesFormat(name=Lucene54), id=DocValuesFormat(name=Memory), 
> multi_valued_int=DocValuesFormat(name=Lucene54), 
> multi_valued_bytes=DocValuesFormat(name=Lucene54), 
> multi_valued_float=DocValuesFormat(name=Lucene54), 
> term_vectors=DocValuesFormat(name=Lucene54)}, maxPointsInLeafNode=419, 
> maxMBSortInHeap=7.394324294878203, 
> sim=RandomSimilarity(queryNorm=true,coord=yes): {positions=DFR I(n)Z(0.3), 
> id=IB SPL-L1, term_vectors=DFR I(ne)B3(800.0)}, locale=ko-KR, 
> timezone=America/Cordoba
>[junit4]   2> NOTE: Linux 3.12.60-52.54-default amd64/Oracle Corporation 
> 1.8.0_111 (64-bit)/cpus=4,threads=1,free=105782608,total=240648192
>[junit4]   2> NOTE: All tests run in this JVM: [TestPayloads, 
> TestSnapshotDeletionPolicy, TestDocValues, FuzzyTermOnShortTermsTest, 
> TestNoMergeScheduler, TestPointValues, TestSegmentInfos, TestStressIndexing2, 
> TestSimpleExplanationsOfNonMatches, TestPrefixInBooleanQuery, TestTermQuery, 
> TestSegmentMerger, TestByteArrayDataInput, TestTransactions, TestMultiFields, 
> TestNRTReaderCleanup, TestPackedInts, TestIndexWriterExceptions, 
> TestSleepingLockWrapper, TestBlockPostingsFormat2, TestLSBRadixSorter, 
> TestSwappedIndexFiles, TestIndexWriterCommit, TestPrefixRandom, 
> Test4GBStoredFields, TestFuzzyQuery, TestCodecUtil, 
> TestSimpleSearchEquivalence, TestWeakIdentityMap, TestIndexWriterOnDiskFull, 
> TestTopDocsMerge, TestOmitTf, TestDuelingCodecs, TestRAMDirectory, 
> TestFlushByRamOrCountsPolicy, TestDemo, TestSimpleExplanationsWithFillerDocs, 
> TestByteSlices, TestParallelLeafReader, TestSortedSetSelector, 
> TestBagOfPostings, TestDemoParallelLeafReader, TestTopFieldCollector, 
> TestSearchForDuplicates, TestStringHelper, TestTragicIndexWriter

[jira] [Commented] (LUCENE-7569) TestIndexSorting failures

2016-11-25 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15695345#comment-15695345
 ] 

Ferenczi Jim commented on LUCENE-7569:
--

I opened https://issues.apache.org/jira/browse/LUCENE-7574 for the recent 
failures. Sorry for the noise.

> TestIndexSorting failures
> -
>
> Key: LUCENE-7569
> URL: https://issues.apache.org/jira/browse/LUCENE-7569
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Steve Rowe
>Assignee: Michael McCandless
>Priority: Blocker
> Fix For: master (7.0), 6.4
>
> Attachments: LUCENE-7569.patch
>
>
> My Jenkins found two reproducing seeds on branch_6x - these look different, 
> but the failures happened on consecutive nightly runs:
> {noformat}
> Checking out Revision 535bf59a3b239f5c7bcd8c00f3e452c9b5e9b539 
> (refs/remotes/origin/branch_6x)
> [...]
>   [junit4] Suite: org.apache.lucene.index.TestIndexSorting
>[junit4]   2> ??? 18, 2016 9:50:39 AM 
> com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
>  uncaughtException
>[junit4]   2> WARNING: Uncaught exception in thread: Thread[Lucene Merge 
> Thread #0,5,TGRP-TestIndexSorting]
>[junit4]   2> org.apache.lucene.index.MergePolicy$MergeException: 
> java.lang.AssertionError: nextValue=4594289799775307848 vs 
> previous=4606302611760746829
>[junit4]   2>at 
> __randomizedtesting.SeedInfo.seed([5F8898DCABBFD056]:0)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648)
>[junit4]   2> Caused by: java.lang.AssertionError: 
> nextValue=4594289799775307848 vs previous=4606302611760746829
>[junit4]   2>at 
> org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152)
>[junit4]   2>at 
> org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470)
>[junit4]   2>at 
> org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243)
>[junit4]   2>at 
> org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:153)
>[junit4]   2>at 
> org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:167)
>[junit4]   2>at 
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:111)
>[junit4]   2>at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4320)
>[junit4]   2>at 
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3897)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
>[junit4]   2> 
>[junit4]   2> NOTE: download the large Jenkins line-docs file by running 
> 'ant get-jenkins-line-docs' in the lucene directory.
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexSorting 
> -Dtests.method=testRandom3 -Dtests.seed=5F8898DCABBFD056 -Dtests.multiplier=2 
> -Dtests.nightly=true -Dtests.slow=true 
> -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt 
> -Dtests.locale=he -Dtests.timezone=Canada/Central -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1
>[junit4] ERROR   14.9s J5  | TestIndexSorting.testRandom3 <<<
>[junit4]> Throwable #1: 
> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:748)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:762)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1566)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1315)
>[junit4]>at 
> org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2019)
>[junit4]>at java.lang.Thread.run(Thread.java:745)
>[junit4]> Caused by: java.lang.AssertionError: 
> nextValue=4594289799775307848 vs previous=4606302611760746829
>[junit4]>at 
> org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152)
>[junit4]>at 
> org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470)
>[junit4]>at 
> 

[jira] [Updated] (LUCENE-7574) Another TestIndexSorting failures

2016-11-25 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7574:
-
Attachment: LUCENE-7574-2.patch

This second patch is for the tesNumericAlreadySorted failure and should be 
applied on master and branch_6x. This test didn't expect that merge with a 
single segment was possible. 

> Another TestIndexSorting failures
> -
>
> Key: LUCENE-7574
> URL: https://issues.apache.org/jira/browse/LUCENE-7574
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 6.x, master (7.0)
>Reporter: Ferenczi Jim
> Attachments: LUCENE-7574-1.patch, LUCENE-7574-2.patch
>
>
> TestIndexSorting still fails with some seeds:
> {noformat}
>[junit4] Suite: org.apache.lucene.index.TestIndexSorting
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexSorting 
> -Dtests.method=testRandom3 -Dtests.seed=6E45BA611FCD7241 -Dtests.slow=true 
> -Dtests.locale=ko-KR -Dtests.timezone=America/Cordoba -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1
>[junit4] FAILURE 0.64s J1 | TestIndexSorting.testRandom3 <<<
>[junit4]> Throwable #1: org.junit.ComparisonFailure: expected:<[449]> 
> but was:<[650]>
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([6E45BA611FCD7241:CC9DF4BB7B3F5B47]:0)
>[junit4]>  at 
> org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2264)
>[junit4]>  at java.lang.Thread.run(Thread.java:745)
>[junit4]   2> NOTE: leaving temporary files on disk at: 
> /var/lib/jenkins/workspace/apache+lucene-solr+branch_6x/lucene/build/core/test/J1/temp/lucene.index.TestIndexSorting_6E45BA611FCD7241-001
>[junit4]   2> NOTE: test params are: codec=Asserting(Lucene62): 
> {docs=PostingsFormat(name=MockRandom), norms=PostingsFormat(name=MockRandom), 
> positions=PostingsFormat(name=Memory doPackFST= true), 
> id=PostingsFormat(name=MockRandom), term_vectors=FSTOrd50}, 
> docValues:{multi_valued_long=DocValuesFormat(name=Direct), 
> double=DocValuesFormat(name=Lucene54), foo=DocValuesFormat(name=Direct), 
> numeric=DocValuesFormat(name=Lucene54), 
> positions=DocValuesFormat(name=Direct), 
> multi_valued_numeric=DocValuesFormat(name=Memory), 
> float=DocValuesFormat(name=Lucene54), int=DocValuesFormat(name=Memory), 
> long=DocValuesFormat(name=Lucene54), points=DocValuesFormat(name=Memory), 
> sorted=DocValuesFormat(name=Lucene54), 
> multi_valued_double=DocValuesFormat(name=Memory), 
> docs=DocValuesFormat(name=Memory), 
> multi_valued_string=DocValuesFormat(name=Memory), 
> norms=DocValuesFormat(name=Memory), bytes=DocValuesFormat(name=Memory), 
> binary=DocValuesFormat(name=Lucene54), id=DocValuesFormat(name=Memory), 
> multi_valued_int=DocValuesFormat(name=Lucene54), 
> multi_valued_bytes=DocValuesFormat(name=Lucene54), 
> multi_valued_float=DocValuesFormat(name=Lucene54), 
> term_vectors=DocValuesFormat(name=Lucene54)}, maxPointsInLeafNode=419, 
> maxMBSortInHeap=7.394324294878203, 
> sim=RandomSimilarity(queryNorm=true,coord=yes): {positions=DFR I(n)Z(0.3), 
> id=IB SPL-L1, term_vectors=DFR I(ne)B3(800.0)}, locale=ko-KR, 
> timezone=America/Cordoba
>[junit4]   2> NOTE: Linux 3.12.60-52.54-default amd64/Oracle Corporation 
> 1.8.0_111 (64-bit)/cpus=4,threads=1,free=105782608,total=240648192
>[junit4]   2> NOTE: All tests run in this JVM: [TestPayloads, 
> TestSnapshotDeletionPolicy, TestDocValues, FuzzyTermOnShortTermsTest, 
> TestNoMergeScheduler, TestPointValues, TestSegmentInfos, TestStressIndexing2, 
> TestSimpleExplanationsOfNonMatches, TestPrefixInBooleanQuery, TestTermQuery, 
> TestSegmentMerger, TestByteArrayDataInput, TestTransactions, TestMultiFields, 
> TestNRTReaderCleanup, TestPackedInts, TestIndexWriterExceptions, 
> TestSleepingLockWrapper, TestBlockPostingsFormat2, TestLSBRadixSorter, 
> TestSwappedIndexFiles, TestIndexWriterCommit, TestPrefixRandom, 
> Test4GBStoredFields, TestFuzzyQuery, TestCodecUtil, 
> TestSimpleSearchEquivalence, TestWeakIdentityMap, TestIndexWriterOnDiskFull, 
> TestTopDocsMerge, TestOmitTf, TestDuelingCodecs, TestRAMDirectory, 
> TestFlushByRamOrCountsPolicy, TestDemo, TestSimpleExplanationsWithFillerDocs, 
> TestByteSlices, TestParallelLeafReader, TestSortedSetSelector, 
> TestBagOfPostings, TestDemoParallelLeafReader, TestTopFieldCollector, 
> TestSearchForDuplicates, TestStringHelper, TestTragicIndexWriterDeadlock, 
> TestDirectPacked, TestSloppyMath, TestPrefixQuery, TestSimpleFSDirectory, 
> TestFixedLengthBytesRefArray, TestIndexingSequenceNumbers, TestCharArraySet, 
> TestRollingBuffer, TestPagedBytes, TestFixedBitSet, TestAutomaton, 
> TestPhrasePrefixQuery, TestMultiPhraseEnum, TestBytesRefAttImpl, 
> TestDocsAndPositions, TestCharsRefBuilder, TestDeterminizeLexicon, 
> TestNIOFSDirectory, TestConju

[jira] [Updated] (LUCENE-7569) TestIndexSorting failures

2016-11-25 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7569:
-
Attachment: (was: LUCENE-7574-1.patch)

> TestIndexSorting failures
> -
>
> Key: LUCENE-7569
> URL: https://issues.apache.org/jira/browse/LUCENE-7569
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Steve Rowe
>Assignee: Michael McCandless
>Priority: Blocker
> Fix For: master (7.0), 6.4
>
> Attachments: LUCENE-7569.patch
>
>
> My Jenkins found two reproducing seeds on branch_6x - these look different, 
> but the failures happened on consecutive nightly runs:
> {noformat}
> Checking out Revision 535bf59a3b239f5c7bcd8c00f3e452c9b5e9b539 
> (refs/remotes/origin/branch_6x)
> [...]
>   [junit4] Suite: org.apache.lucene.index.TestIndexSorting
>[junit4]   2> ??? 18, 2016 9:50:39 AM 
> com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
>  uncaughtException
>[junit4]   2> WARNING: Uncaught exception in thread: Thread[Lucene Merge 
> Thread #0,5,TGRP-TestIndexSorting]
>[junit4]   2> org.apache.lucene.index.MergePolicy$MergeException: 
> java.lang.AssertionError: nextValue=4594289799775307848 vs 
> previous=4606302611760746829
>[junit4]   2>at 
> __randomizedtesting.SeedInfo.seed([5F8898DCABBFD056]:0)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648)
>[junit4]   2> Caused by: java.lang.AssertionError: 
> nextValue=4594289799775307848 vs previous=4606302611760746829
>[junit4]   2>at 
> org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152)
>[junit4]   2>at 
> org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470)
>[junit4]   2>at 
> org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243)
>[junit4]   2>at 
> org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:153)
>[junit4]   2>at 
> org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:167)
>[junit4]   2>at 
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:111)
>[junit4]   2>at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4320)
>[junit4]   2>at 
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3897)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
>[junit4]   2> 
>[junit4]   2> NOTE: download the large Jenkins line-docs file by running 
> 'ant get-jenkins-line-docs' in the lucene directory.
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexSorting 
> -Dtests.method=testRandom3 -Dtests.seed=5F8898DCABBFD056 -Dtests.multiplier=2 
> -Dtests.nightly=true -Dtests.slow=true 
> -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt 
> -Dtests.locale=he -Dtests.timezone=Canada/Central -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1
>[junit4] ERROR   14.9s J5  | TestIndexSorting.testRandom3 <<<
>[junit4]> Throwable #1: 
> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:748)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:762)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1566)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1315)
>[junit4]>at 
> org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2019)
>[junit4]>at java.lang.Thread.run(Thread.java:745)
>[junit4]> Caused by: java.lang.AssertionError: 
> nextValue=4594289799775307848 vs previous=4606302611760746829
>[junit4]>at 
> org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152)
>[junit4]>at 
> org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470)
>[junit4]>at 
> org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243)
>[junit4]>at 
> org.apac

[jira] [Issue Comment Deleted] (LUCENE-7569) TestIndexSorting failures

2016-11-25 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7569:
-
Comment: was deleted

(was: This is a patch for branch_6x only. It fixes the test failure on 
TestIndexSorting#random3. master is not affected since the bug is due to the 
rewriting of a master patch to the 6x branch.
The bug is that we reverse the sort twice when comparing two multi-valued field 
with no values.)

> TestIndexSorting failures
> -
>
> Key: LUCENE-7569
> URL: https://issues.apache.org/jira/browse/LUCENE-7569
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Steve Rowe
>Assignee: Michael McCandless
>Priority: Blocker
> Fix For: master (7.0), 6.4
>
> Attachments: LUCENE-7569.patch
>
>
> My Jenkins found two reproducing seeds on branch_6x - these look different, 
> but the failures happened on consecutive nightly runs:
> {noformat}
> Checking out Revision 535bf59a3b239f5c7bcd8c00f3e452c9b5e9b539 
> (refs/remotes/origin/branch_6x)
> [...]
>   [junit4] Suite: org.apache.lucene.index.TestIndexSorting
>[junit4]   2> ??? 18, 2016 9:50:39 AM 
> com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
>  uncaughtException
>[junit4]   2> WARNING: Uncaught exception in thread: Thread[Lucene Merge 
> Thread #0,5,TGRP-TestIndexSorting]
>[junit4]   2> org.apache.lucene.index.MergePolicy$MergeException: 
> java.lang.AssertionError: nextValue=4594289799775307848 vs 
> previous=4606302611760746829
>[junit4]   2>at 
> __randomizedtesting.SeedInfo.seed([5F8898DCABBFD056]:0)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648)
>[junit4]   2> Caused by: java.lang.AssertionError: 
> nextValue=4594289799775307848 vs previous=4606302611760746829
>[junit4]   2>at 
> org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152)
>[junit4]   2>at 
> org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470)
>[junit4]   2>at 
> org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243)
>[junit4]   2>at 
> org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:153)
>[junit4]   2>at 
> org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:167)
>[junit4]   2>at 
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:111)
>[junit4]   2>at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4320)
>[junit4]   2>at 
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3897)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
>[junit4]   2> 
>[junit4]   2> NOTE: download the large Jenkins line-docs file by running 
> 'ant get-jenkins-line-docs' in the lucene directory.
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexSorting 
> -Dtests.method=testRandom3 -Dtests.seed=5F8898DCABBFD056 -Dtests.multiplier=2 
> -Dtests.nightly=true -Dtests.slow=true 
> -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt 
> -Dtests.locale=he -Dtests.timezone=Canada/Central -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1
>[junit4] ERROR   14.9s J5  | TestIndexSorting.testRandom3 <<<
>[junit4]> Throwable #1: 
> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:748)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:762)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1566)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1315)
>[junit4]>at 
> org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2019)
>[junit4]>at java.lang.Thread.run(Thread.java:745)
>[junit4]> Caused by: java.lang.AssertionError: 
> nextValue=4594289799775307848 vs previous=4606302611760746829
>[junit4]>at 
> org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152)
>  

[jira] [Updated] (LUCENE-7574) Another TestIndexSorting failures

2016-11-25 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7574:
-
Attachment: LUCENE-7574-1.patch

This is a patch for branch_6x only. It fixes the test failure on 
TestIndexSorting#random3. master is not affected since the bug is due to the 
rewriting of a master patch to the 6x branch.
The bug is that we reverse the sort twice when comparing two multi-valued field 
with no values.

> Another TestIndexSorting failures
> -
>
> Key: LUCENE-7574
> URL: https://issues.apache.org/jira/browse/LUCENE-7574
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 6.x, master (7.0)
>Reporter: Ferenczi Jim
> Attachments: LUCENE-7574-1.patch
>
>
> TestIndexSorting still fails with some seeds:
> {noformat}
>[junit4] Suite: org.apache.lucene.index.TestIndexSorting
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexSorting 
> -Dtests.method=testRandom3 -Dtests.seed=6E45BA611FCD7241 -Dtests.slow=true 
> -Dtests.locale=ko-KR -Dtests.timezone=America/Cordoba -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1
>[junit4] FAILURE 0.64s J1 | TestIndexSorting.testRandom3 <<<
>[junit4]> Throwable #1: org.junit.ComparisonFailure: expected:<[449]> 
> but was:<[650]>
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([6E45BA611FCD7241:CC9DF4BB7B3F5B47]:0)
>[junit4]>  at 
> org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2264)
>[junit4]>  at java.lang.Thread.run(Thread.java:745)
>[junit4]   2> NOTE: leaving temporary files on disk at: 
> /var/lib/jenkins/workspace/apache+lucene-solr+branch_6x/lucene/build/core/test/J1/temp/lucene.index.TestIndexSorting_6E45BA611FCD7241-001
>[junit4]   2> NOTE: test params are: codec=Asserting(Lucene62): 
> {docs=PostingsFormat(name=MockRandom), norms=PostingsFormat(name=MockRandom), 
> positions=PostingsFormat(name=Memory doPackFST= true), 
> id=PostingsFormat(name=MockRandom), term_vectors=FSTOrd50}, 
> docValues:{multi_valued_long=DocValuesFormat(name=Direct), 
> double=DocValuesFormat(name=Lucene54), foo=DocValuesFormat(name=Direct), 
> numeric=DocValuesFormat(name=Lucene54), 
> positions=DocValuesFormat(name=Direct), 
> multi_valued_numeric=DocValuesFormat(name=Memory), 
> float=DocValuesFormat(name=Lucene54), int=DocValuesFormat(name=Memory), 
> long=DocValuesFormat(name=Lucene54), points=DocValuesFormat(name=Memory), 
> sorted=DocValuesFormat(name=Lucene54), 
> multi_valued_double=DocValuesFormat(name=Memory), 
> docs=DocValuesFormat(name=Memory), 
> multi_valued_string=DocValuesFormat(name=Memory), 
> norms=DocValuesFormat(name=Memory), bytes=DocValuesFormat(name=Memory), 
> binary=DocValuesFormat(name=Lucene54), id=DocValuesFormat(name=Memory), 
> multi_valued_int=DocValuesFormat(name=Lucene54), 
> multi_valued_bytes=DocValuesFormat(name=Lucene54), 
> multi_valued_float=DocValuesFormat(name=Lucene54), 
> term_vectors=DocValuesFormat(name=Lucene54)}, maxPointsInLeafNode=419, 
> maxMBSortInHeap=7.394324294878203, 
> sim=RandomSimilarity(queryNorm=true,coord=yes): {positions=DFR I(n)Z(0.3), 
> id=IB SPL-L1, term_vectors=DFR I(ne)B3(800.0)}, locale=ko-KR, 
> timezone=America/Cordoba
>[junit4]   2> NOTE: Linux 3.12.60-52.54-default amd64/Oracle Corporation 
> 1.8.0_111 (64-bit)/cpus=4,threads=1,free=105782608,total=240648192
>[junit4]   2> NOTE: All tests run in this JVM: [TestPayloads, 
> TestSnapshotDeletionPolicy, TestDocValues, FuzzyTermOnShortTermsTest, 
> TestNoMergeScheduler, TestPointValues, TestSegmentInfos, TestStressIndexing2, 
> TestSimpleExplanationsOfNonMatches, TestPrefixInBooleanQuery, TestTermQuery, 
> TestSegmentMerger, TestByteArrayDataInput, TestTransactions, TestMultiFields, 
> TestNRTReaderCleanup, TestPackedInts, TestIndexWriterExceptions, 
> TestSleepingLockWrapper, TestBlockPostingsFormat2, TestLSBRadixSorter, 
> TestSwappedIndexFiles, TestIndexWriterCommit, TestPrefixRandom, 
> Test4GBStoredFields, TestFuzzyQuery, TestCodecUtil, 
> TestSimpleSearchEquivalence, TestWeakIdentityMap, TestIndexWriterOnDiskFull, 
> TestTopDocsMerge, TestOmitTf, TestDuelingCodecs, TestRAMDirectory, 
> TestFlushByRamOrCountsPolicy, TestDemo, TestSimpleExplanationsWithFillerDocs, 
> TestByteSlices, TestParallelLeafReader, TestSortedSetSelector, 
> TestBagOfPostings, TestDemoParallelLeafReader, TestTopFieldCollector, 
> TestSearchForDuplicates, TestStringHelper, TestTragicIndexWriterDeadlock, 
> TestDirectPacked, TestSloppyMath, TestPrefixQuery, TestSimpleFSDirectory, 
> TestFixedLengthBytesRefArray, TestIndexingSequenceNumbers, TestCharArraySet, 
> TestRollingBuffer, TestPagedBytes, TestFixedBitSet, TestAutomaton, 
> TestPhrasePrefixQuery, TestMultiPhraseEnum, TestBytesRefAttImpl, 
> TestDocsAndPosi

[jira] [Updated] (LUCENE-7569) TestIndexSorting failures

2016-11-25 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7569:
-
Attachment: LUCENE-7574-1.patch

This is a patch for branch_6x only. It fixes the test failure on 
TestIndexSorting#random3. master is not affected since the bug is due to the 
rewriting of a master patch to the 6x branch.
The bug is that we reverse the sort twice when comparing two multi-valued field 
with no values.

> TestIndexSorting failures
> -
>
> Key: LUCENE-7569
> URL: https://issues.apache.org/jira/browse/LUCENE-7569
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Steve Rowe
>Assignee: Michael McCandless
>Priority: Blocker
> Fix For: master (7.0), 6.4
>
> Attachments: LUCENE-7569.patch, LUCENE-7574-1.patch
>
>
> My Jenkins found two reproducing seeds on branch_6x - these look different, 
> but the failures happened on consecutive nightly runs:
> {noformat}
> Checking out Revision 535bf59a3b239f5c7bcd8c00f3e452c9b5e9b539 
> (refs/remotes/origin/branch_6x)
> [...]
>   [junit4] Suite: org.apache.lucene.index.TestIndexSorting
>[junit4]   2> ??? 18, 2016 9:50:39 AM 
> com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
>  uncaughtException
>[junit4]   2> WARNING: Uncaught exception in thread: Thread[Lucene Merge 
> Thread #0,5,TGRP-TestIndexSorting]
>[junit4]   2> org.apache.lucene.index.MergePolicy$MergeException: 
> java.lang.AssertionError: nextValue=4594289799775307848 vs 
> previous=4606302611760746829
>[junit4]   2>at 
> __randomizedtesting.SeedInfo.seed([5F8898DCABBFD056]:0)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648)
>[junit4]   2> Caused by: java.lang.AssertionError: 
> nextValue=4594289799775307848 vs previous=4606302611760746829
>[junit4]   2>at 
> org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152)
>[junit4]   2>at 
> org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470)
>[junit4]   2>at 
> org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243)
>[junit4]   2>at 
> org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:153)
>[junit4]   2>at 
> org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:167)
>[junit4]   2>at 
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:111)
>[junit4]   2>at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4320)
>[junit4]   2>at 
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3897)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
>[junit4]   2> 
>[junit4]   2> NOTE: download the large Jenkins line-docs file by running 
> 'ant get-jenkins-line-docs' in the lucene directory.
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexSorting 
> -Dtests.method=testRandom3 -Dtests.seed=5F8898DCABBFD056 -Dtests.multiplier=2 
> -Dtests.nightly=true -Dtests.slow=true 
> -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt 
> -Dtests.locale=he -Dtests.timezone=Canada/Central -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1
>[junit4] ERROR   14.9s J5  | TestIndexSorting.testRandom3 <<<
>[junit4]> Throwable #1: 
> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:748)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:762)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1566)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1315)
>[junit4]>at 
> org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2019)
>[junit4]>at java.lang.Thread.run(Thread.java:745)
>[junit4]> Caused by: java.lang.AssertionError: 
> nextValue=4594289799775307848 vs previous=4606302611760746829
>[junit4]>at 
> org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocV

[jira] [Created] (LUCENE-7574) Another TestIndexSorting failures

2016-11-25 Thread Ferenczi Jim (JIRA)
Ferenczi Jim created LUCENE-7574:


 Summary: Another TestIndexSorting failures
 Key: LUCENE-7574
 URL: https://issues.apache.org/jira/browse/LUCENE-7574
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/index
Affects Versions: 6.x, master (7.0)
Reporter: Ferenczi Jim


TestIndexSorting still fails with some seeds:

{noformat}
   [junit4] Suite: org.apache.lucene.index.TestIndexSorting
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexSorting 
-Dtests.method=testRandom3 -Dtests.seed=6E45BA611FCD7241 -Dtests.slow=true 
-Dtests.locale=ko-KR -Dtests.timezone=America/Cordoba -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
   [junit4] FAILURE 0.64s J1 | TestIndexSorting.testRandom3 <<<
   [junit4]> Throwable #1: org.junit.ComparisonFailure: expected:<[449]> 
but was:<[650]>
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([6E45BA611FCD7241:CC9DF4BB7B3F5B47]:0)
   [junit4]>at 
org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2264)
   [junit4]>at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: leaving temporary files on disk at: 
/var/lib/jenkins/workspace/apache+lucene-solr+branch_6x/lucene/build/core/test/J1/temp/lucene.index.TestIndexSorting_6E45BA611FCD7241-001
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene62): 
{docs=PostingsFormat(name=MockRandom), norms=PostingsFormat(name=MockRandom), 
positions=PostingsFormat(name=Memory doPackFST= true), 
id=PostingsFormat(name=MockRandom), term_vectors=FSTOrd50}, 
docValues:{multi_valued_long=DocValuesFormat(name=Direct), 
double=DocValuesFormat(name=Lucene54), foo=DocValuesFormat(name=Direct), 
numeric=DocValuesFormat(name=Lucene54), positions=DocValuesFormat(name=Direct), 
multi_valued_numeric=DocValuesFormat(name=Memory), 
float=DocValuesFormat(name=Lucene54), int=DocValuesFormat(name=Memory), 
long=DocValuesFormat(name=Lucene54), points=DocValuesFormat(name=Memory), 
sorted=DocValuesFormat(name=Lucene54), 
multi_valued_double=DocValuesFormat(name=Memory), 
docs=DocValuesFormat(name=Memory), 
multi_valued_string=DocValuesFormat(name=Memory), 
norms=DocValuesFormat(name=Memory), bytes=DocValuesFormat(name=Memory), 
binary=DocValuesFormat(name=Lucene54), id=DocValuesFormat(name=Memory), 
multi_valued_int=DocValuesFormat(name=Lucene54), 
multi_valued_bytes=DocValuesFormat(name=Lucene54), 
multi_valued_float=DocValuesFormat(name=Lucene54), 
term_vectors=DocValuesFormat(name=Lucene54)}, maxPointsInLeafNode=419, 
maxMBSortInHeap=7.394324294878203, 
sim=RandomSimilarity(queryNorm=true,coord=yes): {positions=DFR I(n)Z(0.3), 
id=IB SPL-L1, term_vectors=DFR I(ne)B3(800.0)}, locale=ko-KR, 
timezone=America/Cordoba
   [junit4]   2> NOTE: Linux 3.12.60-52.54-default amd64/Oracle Corporation 
1.8.0_111 (64-bit)/cpus=4,threads=1,free=105782608,total=240648192
   [junit4]   2> NOTE: All tests run in this JVM: [TestPayloads, 
TestSnapshotDeletionPolicy, TestDocValues, FuzzyTermOnShortTermsTest, 
TestNoMergeScheduler, TestPointValues, TestSegmentInfos, TestStressIndexing2, 
TestSimpleExplanationsOfNonMatches, TestPrefixInBooleanQuery, TestTermQuery, 
TestSegmentMerger, TestByteArrayDataInput, TestTransactions, TestMultiFields, 
TestNRTReaderCleanup, TestPackedInts, TestIndexWriterExceptions, 
TestSleepingLockWrapper, TestBlockPostingsFormat2, TestLSBRadixSorter, 
TestSwappedIndexFiles, TestIndexWriterCommit, TestPrefixRandom, 
Test4GBStoredFields, TestFuzzyQuery, TestCodecUtil, 
TestSimpleSearchEquivalence, TestWeakIdentityMap, TestIndexWriterOnDiskFull, 
TestTopDocsMerge, TestOmitTf, TestDuelingCodecs, TestRAMDirectory, 
TestFlushByRamOrCountsPolicy, TestDemo, TestSimpleExplanationsWithFillerDocs, 
TestByteSlices, TestParallelLeafReader, TestSortedSetSelector, 
TestBagOfPostings, TestDemoParallelLeafReader, TestTopFieldCollector, 
TestSearchForDuplicates, TestStringHelper, TestTragicIndexWriterDeadlock, 
TestDirectPacked, TestSloppyMath, TestPrefixQuery, TestSimpleFSDirectory, 
TestFixedLengthBytesRefArray, TestIndexingSequenceNumbers, TestCharArraySet, 
TestRollingBuffer, TestPagedBytes, TestFixedBitSet, TestAutomaton, 
TestPhrasePrefixQuery, TestMultiPhraseEnum, TestBytesRefAttImpl, 
TestDocsAndPositions, TestCharsRefBuilder, TestDeterminizeLexicon, 
TestNIOFSDirectory, TestConjunctionDISI, TestLiveFieldValues, TestBoolean2, 
TestHighCompressionMode, TestIndexWriterUnicode, TestCachingCollector, 
TestMultiDocValues, TestFilterWeight, TestPerFieldPostingsFormat2, 
TestBytesRefHash, TestBooleanQueryVisitSubscorers, TestMatchAllDocsQuery, 
TestBinaryTerms, TestPositionIncrement, TestNumericTokenStream, TestDateTools, 
Test2BPostings, TestBinaryDocument, TestBooleanScorer, TestNot, 
TestReaderClosed, TestNGramPhraseQuery, TestSimpleAttributeImpl, 
Test2BPostingsBytes, Test2BTerms, TestReusableStringReader, 
TestLucene50StoredFieldsF

[jira] [Commented] (LUCENE-7569) TestIndexSorting failures

2016-11-24 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15693248#comment-15693248
 ] 

Ferenczi Jim commented on LUCENE-7569:
--

{quote}I'm not sure why I didn't realize we must do so as well for the sorted 
set case. {quote}

I am not sure it is needed. The iterations are stateless in 6x since it's just 
an iteration over the docids and the norm ids and we call setDocument on the 
underlying doc values reader each time we need to access it ?

> TestIndexSorting failures
> -
>
> Key: LUCENE-7569
> URL: https://issues.apache.org/jira/browse/LUCENE-7569
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Steve Rowe
>Assignee: Michael McCandless
>Priority: Blocker
> Fix For: master (7.0), 6.4
>
> Attachments: LUCENE-7569.patch
>
>
> My Jenkins found two reproducing seeds on branch_6x - these look different, 
> but the failures happened on consecutive nightly runs:
> {noformat}
> Checking out Revision 535bf59a3b239f5c7bcd8c00f3e452c9b5e9b539 
> (refs/remotes/origin/branch_6x)
> [...]
>   [junit4] Suite: org.apache.lucene.index.TestIndexSorting
>[junit4]   2> ??? 18, 2016 9:50:39 AM 
> com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
>  uncaughtException
>[junit4]   2> WARNING: Uncaught exception in thread: Thread[Lucene Merge 
> Thread #0,5,TGRP-TestIndexSorting]
>[junit4]   2> org.apache.lucene.index.MergePolicy$MergeException: 
> java.lang.AssertionError: nextValue=4594289799775307848 vs 
> previous=4606302611760746829
>[junit4]   2>at 
> __randomizedtesting.SeedInfo.seed([5F8898DCABBFD056]:0)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648)
>[junit4]   2> Caused by: java.lang.AssertionError: 
> nextValue=4594289799775307848 vs previous=4606302611760746829
>[junit4]   2>at 
> org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152)
>[junit4]   2>at 
> org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470)
>[junit4]   2>at 
> org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243)
>[junit4]   2>at 
> org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:153)
>[junit4]   2>at 
> org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:167)
>[junit4]   2>at 
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:111)
>[junit4]   2>at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4320)
>[junit4]   2>at 
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3897)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
>[junit4]   2> 
>[junit4]   2> NOTE: download the large Jenkins line-docs file by running 
> 'ant get-jenkins-line-docs' in the lucene directory.
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexSorting 
> -Dtests.method=testRandom3 -Dtests.seed=5F8898DCABBFD056 -Dtests.multiplier=2 
> -Dtests.nightly=true -Dtests.slow=true 
> -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt 
> -Dtests.locale=he -Dtests.timezone=Canada/Central -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1
>[junit4] ERROR   14.9s J5  | TestIndexSorting.testRandom3 <<<
>[junit4]> Throwable #1: 
> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:748)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:762)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1566)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1315)
>[junit4]>at 
> org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2019)
>[junit4]>at java.lang.Thread.run(Thread.java:745)
>[junit4]> Caused by: java.lang.AssertionError: 
> nextValue=4594289799775307848 vs previous=4606302611760746829
>[junit4]>at 
> org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addS

[jira] [Updated] (LUCENE-7569) TestIndexSorting failures

2016-11-24 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7569:
-
Attachment: LUCENE-7569.patch

It turns out that the problem is a just a test bug related to 
MockRandomMergePolicy. This merge policy is only used in tests and randomly 
wraps the reader to be merged in a SlowCodecReaderWrapper in order to 
deactivate the bulk merging. The test pass if I add a MergeReaderWrapper around 
the original reader. This makes index sorting happy since the docvalues 
instances are now re-created each time.
Bottom line is that merging of multi-valued docvalues works fine even when 
index sorting is on except in these tests where the MockRandomMergePolicy 
disables the bulk merge. 
I've attached a patch that fixes this bug and the other test bug on 
testNumericAlreadySorted. The patch is for the master branch (since 
testNumericAlreadySorted should also fail on this branch) but the backport 
should be straightforward. 

> TestIndexSorting failures
> -
>
> Key: LUCENE-7569
> URL: https://issues.apache.org/jira/browse/LUCENE-7569
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Steve Rowe
>Assignee: Michael McCandless
>Priority: Blocker
> Fix For: master (7.0), 6.4
>
> Attachments: LUCENE-7569.patch
>
>
> My Jenkins found two reproducing seeds on branch_6x - these look different, 
> but the failures happened on consecutive nightly runs:
> {noformat}
> Checking out Revision 535bf59a3b239f5c7bcd8c00f3e452c9b5e9b539 
> (refs/remotes/origin/branch_6x)
> [...]
>   [junit4] Suite: org.apache.lucene.index.TestIndexSorting
>[junit4]   2> ??? 18, 2016 9:50:39 AM 
> com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
>  uncaughtException
>[junit4]   2> WARNING: Uncaught exception in thread: Thread[Lucene Merge 
> Thread #0,5,TGRP-TestIndexSorting]
>[junit4]   2> org.apache.lucene.index.MergePolicy$MergeException: 
> java.lang.AssertionError: nextValue=4594289799775307848 vs 
> previous=4606302611760746829
>[junit4]   2>at 
> __randomizedtesting.SeedInfo.seed([5F8898DCABBFD056]:0)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648)
>[junit4]   2> Caused by: java.lang.AssertionError: 
> nextValue=4594289799775307848 vs previous=4606302611760746829
>[junit4]   2>at 
> org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152)
>[junit4]   2>at 
> org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470)
>[junit4]   2>at 
> org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243)
>[junit4]   2>at 
> org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:153)
>[junit4]   2>at 
> org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:167)
>[junit4]   2>at 
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:111)
>[junit4]   2>at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4320)
>[junit4]   2>at 
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3897)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
>[junit4]   2> 
>[junit4]   2> NOTE: download the large Jenkins line-docs file by running 
> 'ant get-jenkins-line-docs' in the lucene directory.
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexSorting 
> -Dtests.method=testRandom3 -Dtests.seed=5F8898DCABBFD056 -Dtests.multiplier=2 
> -Dtests.nightly=true -Dtests.slow=true 
> -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt 
> -Dtests.locale=he -Dtests.timezone=Canada/Central -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1
>[junit4] ERROR   14.9s J5  | TestIndexSorting.testRandom3 <<<
>[junit4]> Throwable #1: 
> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:748)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:762)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1566)
>[junit4]>at 
> org.apache.l

[jira] [Commented] (LUCENE-7569) TestIndexSorting failures

2016-11-23 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691060#comment-15691060
 ] 

Ferenczi Jim commented on LUCENE-7569:
--

Thanks [~sar...@syr.edu]. We've investigated this with [~jpountz] and found the 
issue. The index sorting during the merge uses a SortingLeafReader which uses a 
cache per thread for the doc value readers. This breaks the merging of 
SortedSetDocValues and SortedNumericDocValues since we need to iterate two 
instances in parallel of these doc values during the merge (see 
DocValuesConsumer.mergeSortedNumericField). [~jpountz]'s idea to fix this bug 
is to rewrite the SortingLeafReader to a SortingCodecReader. The 
SortingCodecReader would never use the same instance when creating a DocValues 
reader. The bug is only in 6x since in master the doc value readers are now 
iterators that are re-created when getDocValues is called.
The second issue about already sorted index is just a test bug that appears 
when the random merge policy picks a merge factor of 2. 
I'll send a patch for these two issues shortly. The first issue is problematic 
though because it means that indices that use index sorting (on any field, 
multivalued or not) in 6x are not able to merge multivalued doc values 
properly. 

> TestIndexSorting failures
> -
>
> Key: LUCENE-7569
> URL: https://issues.apache.org/jira/browse/LUCENE-7569
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Steve Rowe
>Assignee: Michael McCandless
>Priority: Blocker
> Fix For: master (7.0), 6.4
>
>
> My Jenkins found two reproducing seeds on branch_6x - these look different, 
> but the failures happened on consecutive nightly runs:
> {noformat}
> Checking out Revision 535bf59a3b239f5c7bcd8c00f3e452c9b5e9b539 
> (refs/remotes/origin/branch_6x)
> [...]
>   [junit4] Suite: org.apache.lucene.index.TestIndexSorting
>[junit4]   2> ??? 18, 2016 9:50:39 AM 
> com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
>  uncaughtException
>[junit4]   2> WARNING: Uncaught exception in thread: Thread[Lucene Merge 
> Thread #0,5,TGRP-TestIndexSorting]
>[junit4]   2> org.apache.lucene.index.MergePolicy$MergeException: 
> java.lang.AssertionError: nextValue=4594289799775307848 vs 
> previous=4606302611760746829
>[junit4]   2>at 
> __randomizedtesting.SeedInfo.seed([5F8898DCABBFD056]:0)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648)
>[junit4]   2> Caused by: java.lang.AssertionError: 
> nextValue=4594289799775307848 vs previous=4606302611760746829
>[junit4]   2>at 
> org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152)
>[junit4]   2>at 
> org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470)
>[junit4]   2>at 
> org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243)
>[junit4]   2>at 
> org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:153)
>[junit4]   2>at 
> org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:167)
>[junit4]   2>at 
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:111)
>[junit4]   2>at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4320)
>[junit4]   2>at 
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3897)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
>[junit4]   2> 
>[junit4]   2> NOTE: download the large Jenkins line-docs file by running 
> 'ant get-jenkins-line-docs' in the lucene directory.
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexSorting 
> -Dtests.method=testRandom3 -Dtests.seed=5F8898DCABBFD056 -Dtests.multiplier=2 
> -Dtests.nightly=true -Dtests.slow=true 
> -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt 
> -Dtests.locale=he -Dtests.timezone=Canada/Central -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1
>[junit4] ERROR   14.9s J5  | TestIndexSorting.testRandom3 <<<
>[junit4]> Throwable #1: 
> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWri

[jira] [Commented] (LUCENE-7569) TestIndexSorting.testRandom3() failures

2016-11-23 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15690219#comment-15690219
 ] 

Ferenczi Jim commented on LUCENE-7569:
--

I am looking at it. Seems like a bug when sorting multi_valued doc values.

> TestIndexSorting.testRandom3() failures
> ---
>
> Key: LUCENE-7569
> URL: https://issues.apache.org/jira/browse/LUCENE-7569
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Steve Rowe
>Assignee: Michael McCandless
>Priority: Blocker
> Fix For: master (7.0), 6.4
>
>
> My Jenkins found two reproducing seeds on branch_6x - these look different, 
> but the failures happened on consecutive nightly runs:
> {noformat}
> Checking out Revision 535bf59a3b239f5c7bcd8c00f3e452c9b5e9b539 
> (refs/remotes/origin/branch_6x)
> [...]
>   [junit4] Suite: org.apache.lucene.index.TestIndexSorting
>[junit4]   2> ??? 18, 2016 9:50:39 AM 
> com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
>  uncaughtException
>[junit4]   2> WARNING: Uncaught exception in thread: Thread[Lucene Merge 
> Thread #0,5,TGRP-TestIndexSorting]
>[junit4]   2> org.apache.lucene.index.MergePolicy$MergeException: 
> java.lang.AssertionError: nextValue=4594289799775307848 vs 
> previous=4606302611760746829
>[junit4]   2>at 
> __randomizedtesting.SeedInfo.seed([5F8898DCABBFD056]:0)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648)
>[junit4]   2> Caused by: java.lang.AssertionError: 
> nextValue=4594289799775307848 vs previous=4606302611760746829
>[junit4]   2>at 
> org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152)
>[junit4]   2>at 
> org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470)
>[junit4]   2>at 
> org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:243)
>[junit4]   2>at 
> org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:153)
>[junit4]   2>at 
> org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:167)
>[junit4]   2>at 
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:111)
>[junit4]   2>at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4320)
>[junit4]   2>at 
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3897)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
>[junit4]   2>at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
>[junit4]   2> 
>[junit4]   2> NOTE: download the large Jenkins line-docs file by running 
> 'ant get-jenkins-line-docs' in the lucene directory.
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexSorting 
> -Dtests.method=testRandom3 -Dtests.seed=5F8898DCABBFD056 -Dtests.multiplier=2 
> -Dtests.nightly=true -Dtests.slow=true 
> -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt 
> -Dtests.locale=he -Dtests.timezone=Canada/Central -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1
>[junit4] ERROR   14.9s J5  | TestIndexSorting.testRandom3 <<<
>[junit4]> Throwable #1: 
> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:748)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:762)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1566)
>[junit4]>at 
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1315)
>[junit4]>at 
> org.apache.lucene.index.TestIndexSorting.testRandom3(TestIndexSorting.java:2019)
>[junit4]>at java.lang.Thread.run(Thread.java:745)
>[junit4]> Caused by: java.lang.AssertionError: 
> nextValue=4594289799775307848 vs previous=4606302611760746829
>[junit4]>at 
> org.apache.lucene.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addSortedNumericField(AssertingDocValuesFormat.java:152)
>[junit4]>at 
> org.apache.lucene.codecs.DocValuesConsumer.mergeSortedNumericField(DocValuesConsumer.java:470)
>[junit4]>at 
> org.apache.lucene.codecs.DocValuesConsumer.mer

[jira] [Updated] (LUCENE-7568) Optimize merge when index sorting is used but the index is already sorted

2016-11-22 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7568:
-
Attachment: LUCENE-7568.patch

Thanks for the review [~mikemccand]].
I've modified the test with your suggestions. I am not sure I use the 
FilterCodec appropriately though (especially how I choose the delegating 
codec), can you take a look ? 

> Optimize merge when index sorting is used but the index is already sorted
> -
>
> Key: LUCENE-7568
> URL: https://issues.apache.org/jira/browse/LUCENE-7568
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Ferenczi Jim
> Attachments: LUCENE-7568.patch, LUCENE-7568.patch
>
>
> When the index sorting is defined a lot of optimizations are disabled during 
> the merge. For instance the bulk merge of the compressing stored fields is 
> disabled since documents are not merged sequentially. Though it can happen 
> that index sorting is enabled but the index is already in sorted order (the 
> sort field is not filled or filled with the same value for all documents). In 
> such case we can detect that the sort is not needed and activate the merge 
> optimization.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7568) Optimize merge when index sorting is used but the index is already sorted

2016-11-21 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7568:
-
Attachment: LUCENE-7568.patch

Here is a first patch that detects if an index is already sorted and makes this 
information available through MergeState. This information is then used by all 
the merge strategy to activate (or not) some optimizations.

> Optimize merge when index sorting is used but the index is already sorted
> -
>
> Key: LUCENE-7568
> URL: https://issues.apache.org/jira/browse/LUCENE-7568
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Ferenczi Jim
> Attachments: LUCENE-7568.patch
>
>
> When the index sorting is defined a lot of optimizations are disabled during 
> the merge. For instance the bulk merge of the compressing stored fields is 
> disabled since documents are not merged sequentially. Though it can happen 
> that index sorting is enabled but the index is already in sorted order (the 
> sort field is not filled or filled with the same value for all documents). In 
> such case we can detect that the sort is not needed and activate the merge 
> optimization.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7568) Optimize merge when index sorting is used but the index is already sorted

2016-11-21 Thread Ferenczi Jim (JIRA)
Ferenczi Jim created LUCENE-7568:


 Summary: Optimize merge when index sorting is used but the index 
is already sorted
 Key: LUCENE-7568
 URL: https://issues.apache.org/jira/browse/LUCENE-7568
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Ferenczi Jim


When the index sorting is defined a lot of optimizations are disabled during 
the merge. For instance the bulk merge of the compressing stored fields is 
disabled since documents are not merged sequentially. Though it can happen that 
index sorting is enabled but the index is already in sorted order (the sort 
field is not filled or filled with the same value for all documents). In such 
case we can detect that the sort is not needed and activate the merge 
optimization.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7537) Add multi valued field support to index sorting

2016-11-15 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7537:
-
Attachment: LUCENE-7537.patch

Oh right "sorted_string" is ambiguous. Here is another patch with the renaming  
to "multi_valued" for string and numerics.
Thanks [~mikemccand]

> Add multi valued field support to index sorting
> ---
>
> Key: LUCENE-7537
> URL: https://issues.apache.org/jira/browse/LUCENE-7537
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Ferenczi Jim
> Fix For: master (7.0), 6.4
>
> Attachments: LUCENE-7537.patch, LUCENE-7537.patch, LUCENE-7537.patch, 
> LUCENE-7537.patch, LUCENE-7537.patch
>
>
> Today index sorting can be done on single valued field through the 
> NumericDocValues (for numerics) and SortedDocValues (for strings).
> I'd like to add the ability to sort on multi valued fields. Since index 
> sorting does not accept custom comparator we could just take the minimum 
> value of each document for an ascending sort and the maximum value for a 
> descending sort.
> This way we could handle all cases instead of throwing an exception during a 
> merge when we encounter a multi valued DVs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7537) Add multi valued field support to index sorting

2016-11-15 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7537:
-
Attachment: LUCENE-7537.patch

Thanks [~mikemccand], I attached a new patch that addresses your comments. 
I can also make another path for 6.4 if needed.

> Add multi valued field support to index sorting
> ---
>
> Key: LUCENE-7537
> URL: https://issues.apache.org/jira/browse/LUCENE-7537
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Ferenczi Jim
> Fix For: master (7.0), 6.4
>
> Attachments: LUCENE-7537.patch, LUCENE-7537.patch, LUCENE-7537.patch, 
> LUCENE-7537.patch
>
>
> Today index sorting can be done on single valued field through the 
> NumericDocValues (for numerics) and SortedDocValues (for strings).
> I'd like to add the ability to sort on multi valued fields. Since index 
> sorting does not accept custom comparator we could just take the minimum 
> value of each document for an ascending sort and the maximum value for a 
> descending sort.
> This way we could handle all cases instead of throwing an exception during a 
> merge when we encounter a multi valued DVs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7537) Add multi valued field support to index sorting

2016-11-14 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7537:
-
Attachment: LUCENE-7537.patch

Thanks [~mikemccand]. Sorry I didn't ran the full tests for the last patch. 
I've attached a new one which passes all the tests. I fixed the exceptions and 
the SimpleText codec. Could you take another look ? 

> Add multi valued field support to index sorting
> ---
>
> Key: LUCENE-7537
> URL: https://issues.apache.org/jira/browse/LUCENE-7537
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Ferenczi Jim
> Attachments: LUCENE-7537.patch, LUCENE-7537.patch, LUCENE-7537.patch
>
>
> Today index sorting can be done on single valued field through the 
> NumericDocValues (for numerics) and SortedDocValues (for strings).
> I'd like to add the ability to sort on multi valued fields. Since index 
> sorting does not accept custom comparator we could just take the minimum 
> value of each document for an ascending sort and the maximum value for a 
> descending sort.
> This way we could handle all cases instead of throwing an exception during a 
> merge when we encounter a multi valued DVs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (LUCENE-7552) FastVectorHighlighter ignores position in PhraseQuery

2016-11-10 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim closed LUCENE-7552.

Resolution: Duplicate

> FastVectorHighlighter ignores position in PhraseQuery
> -
>
> Key: LUCENE-7552
> URL: https://issues.apache.org/jira/browse/LUCENE-7552
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: Ferenczi Jim
>Priority: Minor
>
> The PhraseQuery contains a list of terms and the positions for each term. The 
> FVH ignores the term position and assumes that a phrase query is always 
> dense. As a result phrase query with gaps are not highlighted at all. This is 
> problematic for text fields that use a FilteringTokenFilter. This token 
> filter removes tokens but preserves the position increment of each removal. 
> Bottom line is that using this token filter breaks the highlighting of phrase 
> query that contains filtered tokens.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7552) FastVectorHighlighter ignores position in PhraseQuery

2016-11-10 Thread Ferenczi Jim (JIRA)
Ferenczi Jim created LUCENE-7552:


 Summary: FastVectorHighlighter ignores position in PhraseQuery
 Key: LUCENE-7552
 URL: https://issues.apache.org/jira/browse/LUCENE-7552
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Reporter: Ferenczi Jim
Priority: Minor


The PhraseQuery contains a list of terms and the positions for each term. The 
FVH ignores the term position and assumes that a phrase query is always dense. 
As a result phrase query with gaps are not highlighted at all. This is 
problematic for text fields that use a FilteringTokenFilter. This token filter 
removes tokens but preserves the position increment of each removal. 
Bottom line is that using this token filter breaks the highlighting of phrase 
query that contains filtered tokens.

 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7551) FastVectorHighlighter ignores position in PhraseQuery

2016-11-10 Thread Ferenczi Jim (JIRA)
Ferenczi Jim created LUCENE-7551:


 Summary: FastVectorHighlighter ignores position in PhraseQuery
 Key: LUCENE-7551
 URL: https://issues.apache.org/jira/browse/LUCENE-7551
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Reporter: Ferenczi Jim
Priority: Minor


The PhraseQuery contains a list of terms and the positions for each term. The 
FVH ignores the term position and assumes that a phrase query is always dense. 
As a result phrase query with gaps are not highlighted at all. This is 
problematic for text fields that use a FilteringTokenFilter. This token filter 
removes tokens but preserves the position increment of each removal. 
Bottom line is that using this token filter breaks the highlighting of phrase 
query that contains filtered tokens.

 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7537) Add multi valued field support to index sorting

2016-11-10 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7537:
-
Attachment: LUCENE-7537.patch

I published a new patch which adds index sort support for SortedSetSortField 
and SortedNumericSortField. 
[~mikemccand] can you take a look ?

> Add multi valued field support to index sorting
> ---
>
> Key: LUCENE-7537
> URL: https://issues.apache.org/jira/browse/LUCENE-7537
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Ferenczi Jim
> Attachments: LUCENE-7537.patch, LUCENE-7537.patch
>
>
> Today index sorting can be done on single valued field through the 
> NumericDocValues (for numerics) and SortedDocValues (for strings).
> I'd like to add the ability to sort on multi valued fields. Since index 
> sorting does not accept custom comparator we could just take the minimum 
> value of each document for an ascending sort and the maximum value for a 
> descending sort.
> This way we could handle all cases instead of throwing an exception during a 
> merge when we encounter a multi valued DVs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7537) Add multi valued field support to index sorting

2016-11-08 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15647978#comment-15647978
 ] 

Ferenczi Jim commented on LUCENE-7537:
--

Thanks [~mikemccand]. I tried this approach and then added the types to clean 
up the serialization and the index sorting check ;) I can totally revert to the 
first version which does what you say. 

> Add multi valued field support to index sorting
> ---
>
> Key: LUCENE-7537
> URL: https://issues.apache.org/jira/browse/LUCENE-7537
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Ferenczi Jim
> Attachments: LUCENE-7537.patch
>
>
> Today index sorting can be done on single valued field through the 
> NumericDocValues (for numerics) and SortedDocValues (for strings).
> I'd like to add the ability to sort on multi valued fields. Since index 
> sorting does not accept custom comparator we could just take the minimum 
> value of each document for an ascending sort and the maximum value for a 
> descending sort.
> This way we could handle all cases instead of throwing an exception during a 
> merge when we encounter a multi valued DVs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7537) Add multi valued field support to index sorting

2016-11-08 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15647858#comment-15647858
 ] 

Ferenczi Jim commented on LUCENE-7537:
--

> The new types do not look useful to me? 

It's to differentiate the underlying DVs and also because I didn't want to 
change the expectation of the native sort. Though I am totally for a single 
type that accepts both DVs if changing the SortField native types is ok. 

>  For instance, DocValues.getSortedSet falls back to 
> LeafReader.getSortedDocValues if the reader does not have SORTED_SET doc 
> values, so all the code that you protected under eg. if (sortField.getType() 
> == SortField.Type.SORTED_STRING) would also work with single-valued (SORTED) 
> doc values (same for SORTED_NUMERIC and NUMERIC doc values).

The leniency is here to catch SortedSetDocValues that ends up with a single 
value per field. But yes it's another point for the merged type.


 

> Add multi valued field support to index sorting
> ---
>
> Key: LUCENE-7537
> URL: https://issues.apache.org/jira/browse/LUCENE-7537
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Ferenczi Jim
> Attachments: LUCENE-7537.patch
>
>
> Today index sorting can be done on single valued field through the 
> NumericDocValues (for numerics) and SortedDocValues (for strings).
> I'd like to add the ability to sort on multi valued fields. Since index 
> sorting does not accept custom comparator we could just take the minimum 
> value of each document for an ascending sort and the maximum value for a 
> descending sort.
> This way we could handle all cases instead of throwing an exception during a 
> merge when we encounter a multi valued DVs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7537) Add multi valued field support to index sorting

2016-11-08 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7537:
-
Attachment: LUCENE-7537.patch

Here is a simple patch that adds the support for multi valued sort directly in 
SortField. It defines 5 new sort types: sorted_string, sorted_long, 
sorted_double, sorted_float, sorted_int and uses the 
Sorted{Set|Numeric}Selector for sorting. The natural order picks the minimum 
value of the list for each document and the reverse order picks the maximum.

This patch also fixes a small bug which showed up in unit tests when using an 
index sorting with reverse sort and a missing value.

> Add multi valued field support to index sorting
> ---
>
> Key: LUCENE-7537
> URL: https://issues.apache.org/jira/browse/LUCENE-7537
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Ferenczi Jim
> Attachments: LUCENE-7537.patch
>
>
> Today index sorting can be done on single valued field through the 
> NumericDocValues (for numerics) and SortedDocValues (for strings).
> I'd like to add the ability to sort on multi valued fields. Since index 
> sorting does not accept custom comparator we could just take the minimum 
> value of each document for an ascending sort and the maximum value for a 
> descending sort.
> This way we could handle all cases instead of throwing an exception during a 
> merge when we encounter a multi valued DVs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7537) Add multi valued field support to index sorting

2016-11-04 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636366#comment-15636366
 ] 

Ferenczi Jim commented on LUCENE-7537:
--

Oh I've already started to work on a patch with the logic described above ;)
I'll post it shortly. Thanks [~mikemccand].

> Add multi valued field support to index sorting
> ---
>
> Key: LUCENE-7537
> URL: https://issues.apache.org/jira/browse/LUCENE-7537
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Ferenczi Jim
>Assignee: Michael McCandless
>
> Today index sorting can be done on single valued field through the 
> NumericDocValues (for numerics) and SortedDocValues (for strings).
> I'd like to add the ability to sort on multi valued fields. Since index 
> sorting does not accept custom comparator we could just take the minimum 
> value of each document for an ascending sort and the maximum value for a 
> descending sort.
> This way we could handle all cases instead of throwing an exception during a 
> merge when we encounter a multi valued DVs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7537) Add multi valued field support to index sorting

2016-11-03 Thread Ferenczi Jim (JIRA)
Ferenczi Jim created LUCENE-7537:


 Summary: Add multi valued field support to index sorting
 Key: LUCENE-7537
 URL: https://issues.apache.org/jira/browse/LUCENE-7537
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Ferenczi Jim


Today index sorting can be done on single valued field through the 
NumericDocValues (for numerics) and SortedDocValues (for strings).
I'd like to add the ability to sort on multi valued fields. Since index sorting 
does not accept custom comparator we could just take the minimum value of each 
document for an ascending sort and the maximum value for a descending sort.
This way we could handle all cases instead of throwing an exception during a 
merge when we encounter a multi valued DVs. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7484) FastVectorHighlighter fails to highlight SynonymQuery

2016-10-10 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15562279#comment-15562279
 ] 

Ferenczi Jim commented on LUCENE-7484:
--

Thanks [~mikemccand]. That was fast !

> FastVectorHighlighter fails to highlight SynonymQuery
> -
>
> Key: LUCENE-7484
> URL: https://issues.apache.org/jira/browse/LUCENE-7484
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/termvectors
>Affects Versions: 6.x, master (7.0)
>Reporter: Ferenczi Jim
> Fix For: master (7.0), 6.3
>
> Attachments: LUCENE-7484.patch
>
>
> SynonymQuery are ignored by the FastVectorHighlighter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7484) FastVectorHighlighter fails to highlight SynonymQuery

2016-10-10 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7484:
-
Attachment: LUCENE-7484.patch

> FastVectorHighlighter fails to highlight SynonymQuery
> -
>
> Key: LUCENE-7484
> URL: https://issues.apache.org/jira/browse/LUCENE-7484
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/termvectors
>Affects Versions: 6.x, master (7.0)
>Reporter: Ferenczi Jim
> Attachments: LUCENE-7484.patch
>
>
> SynonymQuery are ignored by the FastVectorHighlighter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7484) FastVectorHighlighter fails to highlight SynonymQuery

2016-10-10 Thread Ferenczi Jim (JIRA)
Ferenczi Jim created LUCENE-7484:


 Summary: FastVectorHighlighter fails to highlight SynonymQuery
 Key: LUCENE-7484
 URL: https://issues.apache.org/jira/browse/LUCENE-7484
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/termvectors
Affects Versions: 6.x, master (7.0)
Reporter: Ferenczi Jim


SynonymQuery are ignored by the FastVectorHighlighter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

2016-08-25 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim closed LUCENE-7423.

Resolution: Not A Problem

> AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on 
> text fields.
> ---
>
> Key: LUCENE-7423
> URL: https://issues.apache.org/jira/browse/LUCENE-7423
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/sandbox
>Reporter: Ferenczi Jim
>Priority: Minor
> Attachments: LUCENE-7423.patch
>
>
> The autoprefix terms dict added in 
> https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with 
> https://issues.apache.org/jira/browse/LUCENE-7317.
> The new points API is now used to do efficient range queries but the 
> replacement for prefix string queries is unclear. The edge ngrams could be 
> used instead but they have a lot of drawbacks and are hard to configure 
> correctly. The completion postings format is also a good replacement but it 
> requires to have a big FST in RAM and it cannot be intersected with other 
> fields. 
> This patch is a proposal for a new PostingsFormat optimized for prefix query 
> on string fields. It detects prefixes that match "enough" terms and writes 
> auto-prefix terms into their own virtual field.
>  At search time the virtual field is used to speed up prefix queries that 
> match "enough" terms.
> The auto-prefix terms are built in two pass:
> * The first pass builds a compact prefix tree. Since the terms enum is sorted 
> the prefixes are flushed on the fly depending on the input. For each prefix 
> we build its corresponding inverted lists using a DocIdSetBuilder. The first 
> pass visits each term of the field TermsEnum only once. When a prefix is 
> flushed from the prefix tree its inverted lists is dumped into a temporary 
> file for further use. This is necessary since the prefixes are not sorted 
> when they are removed from the tree. The selected auto prefixes are sorted at 
> the end of the first pass.
> * The second pass is a sorted scan of the prefixes and the temporary file is 
> used to read the corresponding inverted lists.
> The patch is just a POC and there are rooms for optimizations but the first 
> results are promising:
> I tested the patch with the geonames dataset. I indexed all the titles with 
> the KeywordAnalyzer and compared the index/merge time and the size of the 
> indices. 
> The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes 
> 572M on disk and it took 130s to index and optimize the 11M titles. 
> The auto prefix index takes 287M on disk and took 70s to index and optimize 
> the same 11M titles. Among the 287M, only 170M are used for the auto prefix 
> fields and the rest is for the regular keyword field. All the auto prefixes 
> were generated for this test (at least 2 terms per auto-prefix).  
> The queries have similar performance since we are sure on both sides that one 
> inverted list can answer any prefix query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

2016-08-25 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434935#comment-15434935
 ] 

Ferenczi Jim edited comment on LUCENE-7423 at 8/25/16 9:05 AM:
---

(edited since the results of the autoprefix were wrong due to a bug in the code 
to generate the prefixes)

I've added a small benchmark AutoPrefixPerf.java (modified from [~mikemccand] 
utils).

For the benchmark I used the english wikipedia title and a standard analyzer:

{panel:title=Standard 
analyzer|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
A single field in this test:
* "field": standard analyzer 

{noformat}
Indexed 1260: 33.756 sec
Final Indexed 12696047: 33.9 sec
Optimize...
After force merge: 37.794 sec
Close...
After close: 37.798 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 
id=ex11gzoft89z21le5c93bpett
  1 of 1: name=_j maxDoc=12696047
version=7.0.0
id=ex11gzoft89z21le5c93bpets
codec=Lucene62
compound=false
numFiles=7
size (MB)=78.562
diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, 
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, 
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, 
source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472043738648}
no deletions
test: open reader.OK [took 0.002 sec]
test: check integrity.OK [took 0.046 sec]
test: check live docs.OK [took 0.000 sec]
test: field infos.OK [1 fields] [took 0.000 sec]
test: field norms.OK [0 fields] [took 0.000 sec]
test: terms, freq, prox...OK [2513966 terms; 34713220 terms/docs pairs; 0 
tokens] [took 2.321 sec]
  field "field":
index FST:
  699982 bytes
terms:
  2513966 terms
  20843092 bytes (8.3 bytes/term)
blocks:
  80953 blocks
  59384 terms-only blocks
  10 sub-block-only blocks
  21559 mixed blocks
  18273 floor blocks
  25611 non-floor blocks
  55342 floor sub-blocks
  13294379 term suffix bytes (164.2 suffix-bytes/block)
  2538232 term stats bytes (31.4 stats-bytes/block)
  8829391 other bytes (109.1 other-bytes/block)
  by prefix length:
 0: 5
 1: 421
 2: 5620
 3: 18794
 4: 31598
 5: 16630
 6: 5322
 7: 1709
 8: 443
 9: 138
10: 249
11: 14
12: 2
13: 6
14: 2
  
test: stored fields...OK [0 total field count; avg 0.0 fields per doc] 
[took 0.257 sec]
test: term vectorsOK [0 total term vector count; avg 0.0 term/freq 
vector fields per doc] [took 0.000 sec]
test: docvalues...OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
test: points..OK [0 fields, 0 points] [took 0.000 sec]

detailed segment RAM usage: 
_j(7.0.0):C12696047: 741.9 KB
|-- postings [PerFieldPostings(segment=_j formats=1)]: 683.8 KB
|-- format 'Lucene50_0' 
[BlockTreeTermsReader(fields=1,delegate=Lucene50PostingsReader(positions=false,payloads=false))]:
 683.8 KB
|-- field 'field' 
[BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 
683.7 KB
|-- term index 
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
|-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 
32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 
58.1 KB
|-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 
58.1 KB
|-- doc base deltas: 29.1 KB
|-- start pointer deltas: 26.6 KB

No problems were detected with this index.
{noformat}
{panel}

-{panel:title=EdgeNgram analyzer  min=2 max=5 
|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}

Two fields for this test:
* "field": standard analyzer
* field-edge: edge ngram analyzer (min=2, max=5) on top of a standard analyzer.

{noformat}
Indexed 1260: 70.831 sec
Final Indexed 12696047: 71.484 sec
Optimize...
After force merge: 80.344 sec
Close...
After close: 80.347 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 
id=8bm8xy2peb5wo3td0ptgwv036
  1 of 1: name=_19 maxDoc=12696047
version=7.0.0
id=8bm8xy2peb5wo3td0ptgwv035
codec=Lucene62
compound=false
numFiles=7
size (MB)=224.803
diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, 
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, 
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, 
source=merge, mergeFactor=15, os.version=10.11.4, timestamp=1472044255056}
no deletions
  

[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

2016-08-25 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7423:
-
Attachment: LUCENE-7423.patch

I fixed another bug in the prefixes creation. Most of the prefixes were missing 
in my last patch so the latest result shows a completely different trend. Sorry 
for the noise [~rcmuir] , you were right, the 2-5 edge ngram is competitive and 
it beats the autoprefix by a large margin.
Here are the raw results actualized with the latest patch:

{panel:title=AutoPrefix 
minPrefixTerms=2|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
{noformat}

Indexed 1260: 60.369 sec
Final Indexed 12696047: 60.589 sec
Optimize...
After force merge: 87.115 sec
Close...
After close: 87.121 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 
id=2jb0oyddk8jizhc2jin5e5vwf
  1 of 1: name=_j maxDoc=12696047
version=7.0.0
id=2jb0oyddk8jizhc2jin5e5vwe
codec=Lucene62
compound=false
numFiles=7
size (MB)=300.525
diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, 
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, 
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, 
source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472114132102}
no deletions
test: open reader.OK [took 0.002 sec]
test: check integrity.OK [took 0.172 sec]
test: check live docs.OK [took 0.002 sec]
test: field infos.OK [2 fields] [took 0.001 sec]
test: field norms.OK [0 fields] [took 0.001 sec]
test: terms, freq, prox...OK [3928482 terms; 31646 terms/docs pairs; 0 
tokens] [took 5.478 sec]
  field "field-autoprefix":
index FST:
  401257 bytes
terms:
  1414516 terms
  10333460 bytes (7.3 bytes/term)
blocks:
  45642 blocks
  33484 terms-only blocks
  4 sub-block-only blocks
  12154 mixed blocks
  10382 floor blocks
  14302 non-floor blocks
  31340 floor sub-blocks
  6305776 term suffix bytes (138.2 suffix-bytes/block)
  1484556 term stats bytes (32.5 stats-bytes/block)
  3661060 other bytes (80.2 other-bytes/block)
  by prefix length:
 0: 6
 1: 366
 2: 3205
 3: 13054
 4: 17849
 5: 7613
 6: 2466
 7: 752
 8: 202
 9: 62
10: 58
11: 7
13: 1
14: 1
  
  field "field":
index FST:
  699971 bytes
terms:
  2513966 terms
  20843092 bytes (8.3 bytes/term)
blocks:
  80953 blocks
  59384 terms-only blocks
  10 sub-block-only blocks
  21559 mixed blocks
  18273 floor blocks
  25611 non-floor blocks
  55342 floor sub-blocks
  13294381 term suffix bytes (164.2 suffix-bytes/block)
  2538232 term stats bytes (31.4 stats-bytes/block)
  8839046 other bytes (109.2 other-bytes/block)
  by prefix length:
 0: 5
 1: 421
 2: 5620
 3: 18794
 4: 31598
 5: 16630
 6: 5322
 7: 1709
 8: 443
 9: 138
10: 249
11: 14
12: 2
13: 6
14: 2
  
test: stored fields...OK [0 total field count; avg 0.0 fields per doc] 
[took 0.304 sec]
test: term vectorsOK [0 total term vector count; avg 0.0 term/freq 
vector fields per doc] [took 0.002 sec]
test: docvalues...OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.001 sec]
test: points..OK [0 fields, 0 points] [took 0.000 sec]

detailed segment RAM usage: 
_j(7.0.0):C12696047: 1.1 MB
|-- postings [PerFieldPostings(segment=_j formats=1)]: 1.1 MB
|-- format 'AutoPrefix_0' 
[BlockTreeTermsReader(fields=2,delegate=Lucene50PostingsReader(positions=false,payloads=false))]:
 1.1 MB
|-- field 'field' 
[BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 
683.7 KB
|-- term index 
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
|-- field 'field-autoprefix' 
[BlockTreeTerms(terms=1414516,postings=187518426,positions=-1,docs=12682564)]: 
392 KB
|-- term index 
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 391.9 KB
|-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 
32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 
61.9 KB
|-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 
61.9 KB
|-- doc base deltas: 30.5 KB
|-- start pointer deltas: 29

[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

2016-08-25 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7423:
-
Attachment: (was: LUCENE-7423.patch)

> AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on 
> text fields.
> ---
>
> Key: LUCENE-7423
> URL: https://issues.apache.org/jira/browse/LUCENE-7423
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/sandbox
>Reporter: Ferenczi Jim
>Priority: Minor
>
> The autoprefix terms dict added in 
> https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with 
> https://issues.apache.org/jira/browse/LUCENE-7317.
> The new points API is now used to do efficient range queries but the 
> replacement for prefix string queries is unclear. The edge ngrams could be 
> used instead but they have a lot of drawbacks and are hard to configure 
> correctly. The completion postings format is also a good replacement but it 
> requires to have a big FST in RAM and it cannot be intersected with other 
> fields. 
> This patch is a proposal for a new PostingsFormat optimized for prefix query 
> on string fields. It detects prefixes that match "enough" terms and writes 
> auto-prefix terms into their own virtual field.
>  At search time the virtual field is used to speed up prefix queries that 
> match "enough" terms.
> The auto-prefix terms are built in two pass:
> * The first pass builds a compact prefix tree. Since the terms enum is sorted 
> the prefixes are flushed on the fly depending on the input. For each prefix 
> we build its corresponding inverted lists using a DocIdSetBuilder. The first 
> pass visits each term of the field TermsEnum only once. When a prefix is 
> flushed from the prefix tree its inverted lists is dumped into a temporary 
> file for further use. This is necessary since the prefixes are not sorted 
> when they are removed from the tree. The selected auto prefixes are sorted at 
> the end of the first pass.
> * The second pass is a sorted scan of the prefixes and the temporary file is 
> used to read the corresponding inverted lists.
> The patch is just a POC and there are rooms for optimizations but the first 
> results are promising:
> I tested the patch with the geonames dataset. I indexed all the titles with 
> the KeywordAnalyzer and compared the index/merge time and the size of the 
> indices. 
> The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes 
> 572M on disk and it took 130s to index and optimize the 11M titles. 
> The auto prefix index takes 287M on disk and took 70s to index and optimize 
> the same 11M titles. Among the 287M, only 170M are used for the auto prefix 
> fields and the rest is for the regular keyword field. All the auto prefixes 
> were generated for this test (at least 2 terms per auto-prefix).  
> The queries have similar performance since we are sure on both sides that one 
> inverted list can answer any prefix query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

2016-08-25 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7423:
-
Attachment: (was: LUCENE-7423.patch)

> AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on 
> text fields.
> ---
>
> Key: LUCENE-7423
> URL: https://issues.apache.org/jira/browse/LUCENE-7423
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/sandbox
>Reporter: Ferenczi Jim
>Priority: Minor
>
> The autoprefix terms dict added in 
> https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with 
> https://issues.apache.org/jira/browse/LUCENE-7317.
> The new points API is now used to do efficient range queries but the 
> replacement for prefix string queries is unclear. The edge ngrams could be 
> used instead but they have a lot of drawbacks and are hard to configure 
> correctly. The completion postings format is also a good replacement but it 
> requires to have a big FST in RAM and it cannot be intersected with other 
> fields. 
> This patch is a proposal for a new PostingsFormat optimized for prefix query 
> on string fields. It detects prefixes that match "enough" terms and writes 
> auto-prefix terms into their own virtual field.
>  At search time the virtual field is used to speed up prefix queries that 
> match "enough" terms.
> The auto-prefix terms are built in two pass:
> * The first pass builds a compact prefix tree. Since the terms enum is sorted 
> the prefixes are flushed on the fly depending on the input. For each prefix 
> we build its corresponding inverted lists using a DocIdSetBuilder. The first 
> pass visits each term of the field TermsEnum only once. When a prefix is 
> flushed from the prefix tree its inverted lists is dumped into a temporary 
> file for further use. This is necessary since the prefixes are not sorted 
> when they are removed from the tree. The selected auto prefixes are sorted at 
> the end of the first pass.
> * The second pass is a sorted scan of the prefixes and the temporary file is 
> used to read the corresponding inverted lists.
> The patch is just a POC and there are rooms for optimizations but the first 
> results are promising:
> I tested the patch with the geonames dataset. I indexed all the titles with 
> the KeywordAnalyzer and compared the index/merge time and the size of the 
> indices. 
> The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes 
> 572M on disk and it took 130s to index and optimize the 11M titles. 
> The auto prefix index takes 287M on disk and took 70s to index and optimize 
> the same 11M titles. Among the 287M, only 170M are used for the auto prefix 
> fields and the rest is for the regular keyword field. All the auto prefixes 
> were generated for this test (at least 2 terms per auto-prefix).  
> The queries have similar performance since we are sure on both sides that one 
> inverted list can answer any prefix query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

2016-08-24 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7423:
-
Attachment: LUCENE-7423.patch

> AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on 
> text fields.
> ---
>
> Key: LUCENE-7423
> URL: https://issues.apache.org/jira/browse/LUCENE-7423
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/sandbox
>Reporter: Ferenczi Jim
>Priority: Minor
> Attachments: LUCENE-7423.patch, LUCENE-7423.patch
>
>
> The autoprefix terms dict added in 
> https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with 
> https://issues.apache.org/jira/browse/LUCENE-7317.
> The new points API is now used to do efficient range queries but the 
> replacement for prefix string queries is unclear. The edge ngrams could be 
> used instead but they have a lot of drawbacks and are hard to configure 
> correctly. The completion postings format is also a good replacement but it 
> requires to have a big FST in RAM and it cannot be intersected with other 
> fields. 
> This patch is a proposal for a new PostingsFormat optimized for prefix query 
> on string fields. It detects prefixes that match "enough" terms and writes 
> auto-prefix terms into their own virtual field.
>  At search time the virtual field is used to speed up prefix queries that 
> match "enough" terms.
> The auto-prefix terms are built in two pass:
> * The first pass builds a compact prefix tree. Since the terms enum is sorted 
> the prefixes are flushed on the fly depending on the input. For each prefix 
> we build its corresponding inverted lists using a DocIdSetBuilder. The first 
> pass visits each term of the field TermsEnum only once. When a prefix is 
> flushed from the prefix tree its inverted lists is dumped into a temporary 
> file for further use. This is necessary since the prefixes are not sorted 
> when they are removed from the tree. The selected auto prefixes are sorted at 
> the end of the first pass.
> * The second pass is a sorted scan of the prefixes and the temporary file is 
> used to read the corresponding inverted lists.
> The patch is just a POC and there are rooms for optimizations but the first 
> results are promising:
> I tested the patch with the geonames dataset. I indexed all the titles with 
> the KeywordAnalyzer and compared the index/merge time and the size of the 
> indices. 
> The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes 
> 572M on disk and it took 130s to index and optimize the 11M titles. 
> The auto prefix index takes 287M on disk and took 70s to index and optimize 
> the same 11M titles. Among the 287M, only 170M are used for the auto prefix 
> fields and the rest is for the regular keyword field. All the auto prefixes 
> were generated for this test (at least 2 terms per auto-prefix).  
> The queries have similar performance since we are sure on both sides that one 
> inverted list can answer any prefix query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

2016-08-24 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7423:
-
Attachment: (was: LUCENE-7423.patch)

> AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on 
> text fields.
> ---
>
> Key: LUCENE-7423
> URL: https://issues.apache.org/jira/browse/LUCENE-7423
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/sandbox
>Reporter: Ferenczi Jim
>Priority: Minor
> Attachments: LUCENE-7423.patch, LUCENE-7423.patch
>
>
> The autoprefix terms dict added in 
> https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with 
> https://issues.apache.org/jira/browse/LUCENE-7317.
> The new points API is now used to do efficient range queries but the 
> replacement for prefix string queries is unclear. The edge ngrams could be 
> used instead but they have a lot of drawbacks and are hard to configure 
> correctly. The completion postings format is also a good replacement but it 
> requires to have a big FST in RAM and it cannot be intersected with other 
> fields. 
> This patch is a proposal for a new PostingsFormat optimized for prefix query 
> on string fields. It detects prefixes that match "enough" terms and writes 
> auto-prefix terms into their own virtual field.
>  At search time the virtual field is used to speed up prefix queries that 
> match "enough" terms.
> The auto-prefix terms are built in two pass:
> * The first pass builds a compact prefix tree. Since the terms enum is sorted 
> the prefixes are flushed on the fly depending on the input. For each prefix 
> we build its corresponding inverted lists using a DocIdSetBuilder. The first 
> pass visits each term of the field TermsEnum only once. When a prefix is 
> flushed from the prefix tree its inverted lists is dumped into a temporary 
> file for further use. This is necessary since the prefixes are not sorted 
> when they are removed from the tree. The selected auto prefixes are sorted at 
> the end of the first pass.
> * The second pass is a sorted scan of the prefixes and the temporary file is 
> used to read the corresponding inverted lists.
> The patch is just a POC and there are rooms for optimizations but the first 
> results are promising:
> I tested the patch with the geonames dataset. I indexed all the titles with 
> the KeywordAnalyzer and compared the index/merge time and the size of the 
> indices. 
> The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes 
> 572M on disk and it took 130s to index and optimize the 11M titles. 
> The auto prefix index takes 287M on disk and took 70s to index and optimize 
> the same 11M titles. Among the 287M, only 170M are used for the auto prefix 
> fields and the rest is for the regular keyword field. All the auto prefixes 
> were generated for this test (at least 2 terms per auto-prefix).  
> The queries have similar performance since we are sure on both sides that one 
> inverted list can answer any prefix query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

2016-08-24 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434935#comment-15434935
 ] 

Ferenczi Jim edited comment on LUCENE-7423 at 8/24/16 1:32 PM:
---

Another iteration. I fixed the prefix selection (the term "aa" should not 
increment the number of terms accounted for the term "a"). This reduces the 
index size greatly.
I've added a small benchmark AutoPrefixPerf.java (modified from [~mikemccand] 
utils).

For the benchmark I used the english wikipedia title and a standard analyzer:

{panel:title=Standard 
analyzer|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
A single field in this test:
* "field": standard analyzer 

{noformat}
Indexed 1260: 33.756 sec
Final Indexed 12696047: 33.9 sec
Optimize...
After force merge: 37.794 sec
Close...
After close: 37.798 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 
id=ex11gzoft89z21le5c93bpett
  1 of 1: name=_j maxDoc=12696047
version=7.0.0
id=ex11gzoft89z21le5c93bpets
codec=Lucene62
compound=false
numFiles=7
size (MB)=78.562
diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, 
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, 
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, 
source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472043738648}
no deletions
test: open reader.OK [took 0.002 sec]
test: check integrity.OK [took 0.046 sec]
test: check live docs.OK [took 0.000 sec]
test: field infos.OK [1 fields] [took 0.000 sec]
test: field norms.OK [0 fields] [took 0.000 sec]
test: terms, freq, prox...OK [2513966 terms; 34713220 terms/docs pairs; 0 
tokens] [took 2.321 sec]
  field "field":
index FST:
  699982 bytes
terms:
  2513966 terms
  20843092 bytes (8.3 bytes/term)
blocks:
  80953 blocks
  59384 terms-only blocks
  10 sub-block-only blocks
  21559 mixed blocks
  18273 floor blocks
  25611 non-floor blocks
  55342 floor sub-blocks
  13294379 term suffix bytes (164.2 suffix-bytes/block)
  2538232 term stats bytes (31.4 stats-bytes/block)
  8829391 other bytes (109.1 other-bytes/block)
  by prefix length:
 0: 5
 1: 421
 2: 5620
 3: 18794
 4: 31598
 5: 16630
 6: 5322
 7: 1709
 8: 443
 9: 138
10: 249
11: 14
12: 2
13: 6
14: 2
  
test: stored fields...OK [0 total field count; avg 0.0 fields per doc] 
[took 0.257 sec]
test: term vectorsOK [0 total term vector count; avg 0.0 term/freq 
vector fields per doc] [took 0.000 sec]
test: docvalues...OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
test: points..OK [0 fields, 0 points] [took 0.000 sec]

detailed segment RAM usage: 
_j(7.0.0):C12696047: 741.9 KB
|-- postings [PerFieldPostings(segment=_j formats=1)]: 683.8 KB
|-- format 'Lucene50_0' 
[BlockTreeTermsReader(fields=1,delegate=Lucene50PostingsReader(positions=false,payloads=false))]:
 683.8 KB
|-- field 'field' 
[BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 
683.7 KB
|-- term index 
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
|-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 
32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 
58.1 KB
|-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 
58.1 KB
|-- doc base deltas: 29.1 KB
|-- start pointer deltas: 26.6 KB

No problems were detected with this index.
{noformat}
{panel}

{panel:title=EdgeNgram analyzer  min=2 max=5 
|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}

Two fields for this test:
* "field": standard analyzer
* field-edge: edge ngram analyzer (min=2, max=5) on top of a standard analyzer.

{noformat}
Indexed 1260: 70.831 sec
Final Indexed 12696047: 71.484 sec
Optimize...
After force merge: 80.344 sec
Close...
After close: 80.347 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 
id=8bm8xy2peb5wo3td0ptgwv036
  1 of 1: name=_19 maxDoc=12696047
version=7.0.0
id=8bm8xy2peb5wo3td0ptgwv035
codec=Lucene62
compound=false
numFiles=7
size (MB)=224.803
diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, 
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, 
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, 
source=merge, mergeFactor=15, o

[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

2016-08-24 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7423:
-
Attachment: LUCENE-7423.patch

Another iteration. I fixed the prefix selection (the term "aa" should not 
increment the number of terms accounted for the term "a"). This reduces the 
index size greatly.
I've added a small benchmark AutoPrefixPerf.java (modified from [~mikemccand] 
utils).

For the benchmark I used the english wikipedia title and a standard analyzer:

{panel:title=Standard 
analyzer|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
A single field in this test:
* "field": standard analyzer 

{noformat}
Indexed 1260: 33.756 sec
Final Indexed 12696047: 33.9 sec
Optimize...
After force merge: 37.794 sec
Close...
After close: 37.798 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 
id=ex11gzoft89z21le5c93bpett
  1 of 1: name=_j maxDoc=12696047
version=7.0.0
id=ex11gzoft89z21le5c93bpets
codec=Lucene62
compound=false
numFiles=7
size (MB)=78.562
diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, 
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, 
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, 
source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472043738648}
no deletions
test: open reader.OK [took 0.002 sec]
test: check integrity.OK [took 0.046 sec]
test: check live docs.OK [took 0.000 sec]
test: field infos.OK [1 fields] [took 0.000 sec]
test: field norms.OK [0 fields] [took 0.000 sec]
test: terms, freq, prox...OK [2513966 terms; 34713220 terms/docs pairs; 0 
tokens] [took 2.321 sec]
  field "field":
index FST:
  699982 bytes
terms:
  2513966 terms
  20843092 bytes (8.3 bytes/term)
blocks:
  80953 blocks
  59384 terms-only blocks
  10 sub-block-only blocks
  21559 mixed blocks
  18273 floor blocks
  25611 non-floor blocks
  55342 floor sub-blocks
  13294379 term suffix bytes (164.2 suffix-bytes/block)
  2538232 term stats bytes (31.4 stats-bytes/block)
  8829391 other bytes (109.1 other-bytes/block)
  by prefix length:
 0: 5
 1: 421
 2: 5620
 3: 18794
 4: 31598
 5: 16630
 6: 5322
 7: 1709
 8: 443
 9: 138
10: 249
11: 14
12: 2
13: 6
14: 2
  
test: stored fields...OK [0 total field count; avg 0.0 fields per doc] 
[took 0.257 sec]
test: term vectorsOK [0 total term vector count; avg 0.0 term/freq 
vector fields per doc] [took 0.000 sec]
test: docvalues...OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
test: points..OK [0 fields, 0 points] [took 0.000 sec]

detailed segment RAM usage: 
_j(7.0.0):C12696047: 741.9 KB
|-- postings [PerFieldPostings(segment=_j formats=1)]: 683.8 KB
|-- format 'Lucene50_0' 
[BlockTreeTermsReader(fields=1,delegate=Lucene50PostingsReader(positions=false,payloads=false))]:
 683.8 KB
|-- field 'field' 
[BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 
683.7 KB
|-- term index 
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
|-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 
32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 
58.1 KB
|-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 
58.1 KB
|-- doc base deltas: 29.1 KB
|-- start pointer deltas: 26.6 KB

No problems were detected with this index.
{noformat}
{panel}

{panel:title=EdgeNgram analyzer  min=2 max=5 
|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}

Two fields for this test:
* "field": standard analyzer
* field-edge: edge ngram analyzer (min=2, max=5) on top of a standard analyzer.

{noformat}
Indexed 1260: 70.831 sec
Final Indexed 12696047: 71.484 sec
Optimize...
After force merge: 80.344 sec
Close...
After close: 80.347 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 
id=8bm8xy2peb5wo3td0ptgwv036
  1 of 1: name=_19 maxDoc=12696047
version=7.0.0
id=8bm8xy2peb5wo3td0ptgwv035
codec=Lucene62
compound=false
numFiles=7
size (MB)=224.803
diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, 
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, 
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, 
source=merge, mergeFactor=15, os.version=10.11.4, timestamp=1472044255056}
no deletions
test: o

[jira] [Commented] (LUCENE-7317) Remove auto prefix terms

2016-08-23 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433832#comment-15433832
 ] 

Ferenczi Jim commented on LUCENE-7317:
--

Sorry for the late reply. Yep min=1/max=2B is not a reasonable setting but I 
have similar results with min=1/max=20 so I think it is worth investigating.
I opened https://issues.apache.org/jira/browse/LUCENE-7423 which re-implements 
the auto prefix in a new PostingsFormat that builds the prefixes in two pass 
like the previous implementation. The nice thing is that it avoids the 
combinatorial explosion  that affected the previous implementation where we 
needed to visit all the matching terms for each prefix.

> Remove auto prefix terms
> 
>
> Key: LUCENE-7317
> URL: https://issues.apache.org/jira/browse/LUCENE-7317
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-7317.patch
>
>
> This was mostly superseded by the new points API so should we remove 
> auto-prefix terms?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

2016-08-23 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7423:
-
Attachment: LUCENE-7423.patch

> AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on 
> text fields.
> ---
>
> Key: LUCENE-7423
> URL: https://issues.apache.org/jira/browse/LUCENE-7423
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/sandbox
>Reporter: Ferenczi Jim
>Priority: Minor
> Attachments: LUCENE-7423.patch
>
>
> The autoprefix terms dict added in 
> https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with 
> https://issues.apache.org/jira/browse/LUCENE-7317.
> The new points API is now used to do efficient range queries but the 
> replacement for prefix string queries is unclear. The edge ngrams could be 
> used instead but they have a lot of drawbacks and are hard to configure 
> correctly. The completion postings format is also a good replacement but it 
> requires to have a big FST in RAM and it cannot be intersected with other 
> fields. 
> This patch is a proposal for a new PostingsFormat optimized for prefix query 
> on string fields. It detects prefixes that match "enough" terms and writes 
> auto-prefix terms into their own virtual field.
>  At search time the virtual field is used to speed up prefix queries that 
> match "enough" terms.
> The auto-prefix terms are built in two pass:
> * The first pass builds a compact prefix tree. Since the terms enum is sorted 
> the prefixes are flushed on the fly depending on the input. For each prefix 
> we build its corresponding inverted lists using a DocIdSetBuilder. The first 
> pass visits each term of the field TermsEnum only once. When a prefix is 
> flushed from the prefix tree its inverted lists is dumped into a temporary 
> file for further use. This is necessary since the prefixes are not sorted 
> when they are removed from the tree. The selected auto prefixes are sorted at 
> the end of the first pass.
> * The second pass is a sorted scan of the prefixes and the temporary file is 
> used to read the corresponding inverted lists.
> The patch is just a POC and there are rooms for optimizations but the first 
> results are promising:
> I tested the patch with the geonames dataset. I indexed all the titles with 
> the KeywordAnalyzer and compared the index/merge time and the size of the 
> indices. 
> The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes 
> 572M on disk and it took 130s to index and optimize the 11M titles. 
> The auto prefix index takes 287M on disk and took 70s to index and optimize 
> the same 11M titles. Among the 287M, only 170M are used for the auto prefix 
> fields and the rest is for the regular keyword field. All the auto prefixes 
> were generated for this test (at least 2 terms per auto-prefix).  
> The queries have similar performance since we are sure on both sides that one 
> inverted list can answer any prefix query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

2016-08-23 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7423:
-
Attachment: (was: LUCENE-7423.patch)

> AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on 
> text fields.
> ---
>
> Key: LUCENE-7423
> URL: https://issues.apache.org/jira/browse/LUCENE-7423
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/sandbox
>Reporter: Ferenczi Jim
>Priority: Minor
> Attachments: LUCENE-7423.patch
>
>
> The autoprefix terms dict added in 
> https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with 
> https://issues.apache.org/jira/browse/LUCENE-7317.
> The new points API is now used to do efficient range queries but the 
> replacement for prefix string queries is unclear. The edge ngrams could be 
> used instead but they have a lot of drawbacks and are hard to configure 
> correctly. The completion postings format is also a good replacement but it 
> requires to have a big FST in RAM and it cannot be intersected with other 
> fields. 
> This patch is a proposal for a new PostingsFormat optimized for prefix query 
> on string fields. It detects prefixes that match "enough" terms and writes 
> auto-prefix terms into their own virtual field.
>  At search time the virtual field is used to speed up prefix queries that 
> match "enough" terms.
> The auto-prefix terms are built in two pass:
> * The first pass builds a compact prefix tree. Since the terms enum is sorted 
> the prefixes are flushed on the fly depending on the input. For each prefix 
> we build its corresponding inverted lists using a DocIdSetBuilder. The first 
> pass visits each term of the field TermsEnum only once. When a prefix is 
> flushed from the prefix tree its inverted lists is dumped into a temporary 
> file for further use. This is necessary since the prefixes are not sorted 
> when they are removed from the tree. The selected auto prefixes are sorted at 
> the end of the first pass.
> * The second pass is a sorted scan of the prefixes and the temporary file is 
> used to read the corresponding inverted lists.
> The patch is just a POC and there are rooms for optimizations but the first 
> results are promising:
> I tested the patch with the geonames dataset. I indexed all the titles with 
> the KeywordAnalyzer and compared the index/merge time and the size of the 
> indices. 
> The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes 
> 572M on disk and it took 130s to index and optimize the 11M titles. 
> The auto prefix index takes 287M on disk and took 70s to index and optimize 
> the same 11M titles. Among the 287M, only 170M are used for the auto prefix 
> fields and the rest is for the regular keyword field. All the auto prefixes 
> were generated for this test (at least 2 terms per auto-prefix).  
> The queries have similar performance since we are sure on both sides that one 
> inverted list can answer any prefix query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

2016-08-23 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7423:
-
Attachment: LUCENE-7423.patch

> AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on 
> text fields.
> ---
>
> Key: LUCENE-7423
> URL: https://issues.apache.org/jira/browse/LUCENE-7423
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/sandbox
>Reporter: Ferenczi Jim
>Priority: Minor
> Attachments: LUCENE-7423.patch
>
>
> The autoprefix terms dict added in 
> https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with 
> https://issues.apache.org/jira/browse/LUCENE-7317.
> The new points API is now used to do efficient range queries but the 
> replacement for prefix string queries is unclear. The edge ngrams could be 
> used instead but they have a lot of drawbacks and are hard to configure 
> correctly. The completion postings format is also a good replacement but it 
> requires to have a big FST in RAM and it cannot be intersected with other 
> fields. 
> This patch is a proposal for a new PostingsFormat optimized for prefix query 
> on string fields. It detects prefixes that match "enough" terms and writes 
> auto-prefix terms into their own virtual field.
>  At search time the virtual field is used to speed up prefix queries that 
> match "enough" terms.
> The auto-prefix terms are built in two pass:
> * The first pass builds a compact prefix tree. Since the terms enum is sorted 
> the prefixes are flushed on the fly depending on the input. For each prefix 
> we build its corresponding inverted lists using a DocIdSetBuilder. The first 
> pass visits each term of the field TermsEnum only once. When a prefix is 
> flushed from the prefix tree its inverted lists is dumped into a temporary 
> file for further use. This is necessary since the prefixes are not sorted 
> when they are removed from the tree. The selected auto prefixes are sorted at 
> the end of the first pass.
> * The second pass is a sorted scan of the prefixes and the temporary file is 
> used to read the corresponding inverted lists.
> The patch is just a POC and there are rooms for optimizations but the first 
> results are promising:
> I tested the patch with the geonames dataset. I indexed all the titles with 
> the KeywordAnalyzer and compared the index/merge time and the size of the 
> indices. 
> The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes 
> 572M on disk and it took 130s to index and optimize the 11M titles. 
> The auto prefix index takes 287M on disk and took 70s to index and optimize 
> the same 11M titles. Among the 287M, only 170M are used for the auto prefix 
> fields and the rest is for the regular keyword field. All the auto prefixes 
> were generated for this test (at least 2 terms per auto-prefix).  
> The queries have similar performance since we are sure on both sides that one 
> inverted list can answer any prefix query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

2016-08-23 Thread Ferenczi Jim (JIRA)
Ferenczi Jim created LUCENE-7423:


 Summary: AutoPrefixPostingsFormat: a PostingsFormat optimized for 
prefix queries on text fields.
 Key: LUCENE-7423
 URL: https://issues.apache.org/jira/browse/LUCENE-7423
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/sandbox
Reporter: Ferenczi Jim
Priority: Minor


The autoprefix terms dict added in 
https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with 
https://issues.apache.org/jira/browse/LUCENE-7317.

The new points API is now used to do efficient range queries but the 
replacement for prefix string queries is unclear. The edge ngrams could be used 
instead but they have a lot of drawbacks and are hard to configure correctly. 
The completion postings format is also a good replacement but it requires to 
have a big FST in RAM and it cannot be intersected with other fields. 

This patch is a proposal for a new PostingsFormat optimized for prefix query on 
string fields. It detects prefixes that match "enough" terms and writes 
auto-prefix terms into their own virtual field.
 At search time the virtual field is used to speed up prefix queries that match 
"enough" terms.
The auto-prefix terms are built in two pass:
* The first pass builds a compact prefix tree. Since the terms enum is sorted 
the prefixes are flushed on the fly depending on the input. For each prefix we 
build its corresponding inverted lists using a DocIdSetBuilder. The first pass 
visits each term of the field TermsEnum only once. When a prefix is flushed 
from the prefix tree its inverted lists is dumped into a temporary file for 
further use. This is necessary since the prefixes are not sorted when they are 
removed from the tree. The selected auto prefixes are sorted at the end of the 
first pass.

* The second pass is a sorted scan of the prefixes and the temporary file is 
used to read the corresponding inverted lists.

The patch is just a POC and there are rooms for optimizations but the first 
results are promising:
I tested the patch with the geonames dataset. I indexed all the titles with the 
KeywordAnalyzer and compared the index/merge time and the size of the indices. 

The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes 
572M on disk and it took 130s to index and optimize the 11M titles. 

The auto prefix index takes 287M on disk and took 70s to index and optimize the 
same 11M titles. Among the 287M, only 170M are used for the auto prefix fields 
and the rest is for the regular keyword field. All the auto prefixes were 
generated for this test (at least 2 terms per auto-prefix).  

The queries have similar performance since we are sure on both sides that one 
inverted list can answer any prefix query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7317) Remove auto prefix terms

2016-08-18 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426601#comment-15426601
 ] 

Ferenczi Jim commented on LUCENE-7317:
--

I wanted to see what we're loosing with the removal of the AutoPrefix so I ran 
a small test with English Wikipedia title.

I indexed the 12M titles in three indices:
* *default*: keyword analyzer and the default postings format
* *auto_prefix*: keyword analyzer and the AutoPrefixPostings format with 
minAutoPrefix=24, maxAutoPrefix=Integer.MAX
* *edge*: edge ngram analyzer  with minGram=1,maxGram=Integer.MAX and the 
default postings format. 

||index||default||auto_prefix||edge||
||size in MB||231MB||274 MB||1600MB||

This table shows the size that each index takes on disk in bytes. As you can 
see the auto_prefix is very close to the size of the default one even though we 
compute all the prefix with more than 24 terms. Compared to the edge_ngram 
which multiplies the index size by a factor 7, the auto prefix seems to be a 
good trade off for fields where prefix queries are the norm. I didn't compare 
the query time but any prefix with more than 24 terms could be resolved by one 
inverted list in the auto_prefix index so it is equivalent to the edge_ngram 
index. 
The downside of the auto_prefix seems to be the merge, it takes more than 1 
minute to optimize, this is 10 times slower than the default index. Though this 
is expected since the default index uses a keyword analyzer. 

I understand that the new points APIs is better for numeric prefix/range 
queries but the auto prefix seems to be a good fit for prefix string queries. 
It saves a lot of spaces compared to edge ngram and the indexation is faster. I 
am not saying we should restore the functionality inside the default 
BlockTreeTerms but maybe we could create a separate postings format that 
exposes this feature ?


> Remove auto prefix terms
> 
>
> Key: LUCENE-7317
> URL: https://issues.apache.org/jira/browse/LUCENE-7317
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-7317.patch
>
>
> This was mostly superseded by the new points API so should we remove 
> auto-prefix terms?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7337) MultiTermQuery are sometimes rewritten into an empty boolean query

2016-06-20 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15339090#comment-15339090
 ] 

Ferenczi Jim edited comment on LUCENE-7337 at 6/20/16 7:50 AM:
---

Wooo thanks [~mikemccand]

??I think getting proper distributed queries working is really out of scope 
here: that would really require a distributed rewrite to work correctly.??

Agreed. Returning 1 or 0 for the queryNorm would not solve the problem anyway 
and I think it's more important to make empty-clause boolean query behaves 
exactly the same as the MatchNoDocsQuery. 


was (Author: jim.ferenczi):
Wooo thanks [~mikemccand]

??I think getting proper distributed queries working is really out of
scope here: that would really require a distributed rewrite to work
correctly.??

Agreed. Returning 1 or 0 for the queryNorm would not solve the problem anyway 
and I think it's more important to make empty-clause boolean query behaves 
exactly the same as the MatchNoDocsQuery. 

> MultiTermQuery are sometimes rewritten into an empty boolean query
> --
>
> Key: LUCENE-7337
> URL: https://issues.apache.org/jira/browse/LUCENE-7337
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: Ferenczi Jim
>Priority: Minor
> Attachments: LUCENE-7337.patch
>
>
> MultiTermQuery are sometimes rewritten to an empty boolean query (depending 
> on the rewrite method), it can happen when no expansions are found on a fuzzy 
> query for instance.
> It can be problematic when the multi term query is boosted. 
> For instance consider the following query:
> `((title:bar~1)^100 text:bar)`
> This is a boolean query with two optional clauses. The first one is a fuzzy 
> query on the field title with a boost of 100. 
> If there is no expansion for "title:bar~1" the query is rewritten into:
> `(()^100 text:bar)`
> ... and when expansions are found:
> `((title:bars | title:bar)^100 text:bar)`
> The scoring of those two queries will differ because the normalization factor 
> and the norm for the first query will be equal to 1 (the boost is ignored 
> because the empty boolean query is not taken into account for the computation 
> of the normalization factor) whereas the second query will have a 
> normalization factor of 10,000 (100*100) and a norm equal to 0.01. 
> This kind of discrepancy can happen in a single index because the expansions 
> for the fuzzy query are done at the segment level. It can also happen when 
> multiple indices are requested (Solr/ElasticSearch case).
> A simple fix would be to replace the empty boolean query produced by the 
> multi term query with a MatchNoDocsQuery but I am not sure that it's the best 
> way to fix. WDYT ?
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7337) MultiTermQuery are sometimes rewritten into an empty boolean query

2016-06-20 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15339090#comment-15339090
 ] 

Ferenczi Jim edited comment on LUCENE-7337 at 6/20/16 7:47 AM:
---

Wooo thanks [~mikemccand]

??I think getting proper distributed queries working is really out of
scope here: that would really require a distributed rewrite to work
correctly.??

Agreed. Returning 1 or 0 for the queryNorm would not solve the problem anyway 
and I think it's more important to make empty-clause boolean query behaves 
exactly the same as the MatchNoDocsQuery. 


was (Author: jim.ferenczi):
Wooo thanks [~mikemccand]

?I think getting proper distributed queries working is really out of
scope here: that would really require a distributed rewrite to work
correctly.?

Agreed. Returning 1 or 0 for the queryNorm would not solve the problem anyway 
and I think it's more important to make empty-clause boolean query behaves 
exactly the same as the MatchNoDocsQuery. 

> MultiTermQuery are sometimes rewritten into an empty boolean query
> --
>
> Key: LUCENE-7337
> URL: https://issues.apache.org/jira/browse/LUCENE-7337
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: Ferenczi Jim
>Priority: Minor
> Attachments: LUCENE-7337.patch
>
>
> MultiTermQuery are sometimes rewritten to an empty boolean query (depending 
> on the rewrite method), it can happen when no expansions are found on a fuzzy 
> query for instance.
> It can be problematic when the multi term query is boosted. 
> For instance consider the following query:
> `((title:bar~1)^100 text:bar)`
> This is a boolean query with two optional clauses. The first one is a fuzzy 
> query on the field title with a boost of 100. 
> If there is no expansion for "title:bar~1" the query is rewritten into:
> `(()^100 text:bar)`
> ... and when expansions are found:
> `((title:bars | title:bar)^100 text:bar)`
> The scoring of those two queries will differ because the normalization factor 
> and the norm for the first query will be equal to 1 (the boost is ignored 
> because the empty boolean query is not taken into account for the computation 
> of the normalization factor) whereas the second query will have a 
> normalization factor of 10,000 (100*100) and a norm equal to 0.01. 
> This kind of discrepancy can happen in a single index because the expansions 
> for the fuzzy query are done at the segment level. It can also happen when 
> multiple indices are requested (Solr/ElasticSearch case).
> A simple fix would be to replace the empty boolean query produced by the 
> multi term query with a MatchNoDocsQuery but I am not sure that it's the best 
> way to fix. WDYT ?
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7337) MultiTermQuery are sometimes rewritten into an empty boolean query

2016-06-20 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15339090#comment-15339090
 ] 

Ferenczi Jim commented on LUCENE-7337:
--

Wooo thanks [~mikemccand]

?I think getting proper distributed queries working is really out of
scope here: that would really require a distributed rewrite to work
correctly.?

Agreed. Returning 1 or 0 for the queryNorm would not solve the problem anyway 
and I think it's more important to make empty-clause boolean query behaves 
exactly the same as the MatchNoDocsQuery. 

> MultiTermQuery are sometimes rewritten into an empty boolean query
> --
>
> Key: LUCENE-7337
> URL: https://issues.apache.org/jira/browse/LUCENE-7337
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: Ferenczi Jim
>Priority: Minor
> Attachments: LUCENE-7337.patch
>
>
> MultiTermQuery are sometimes rewritten to an empty boolean query (depending 
> on the rewrite method), it can happen when no expansions are found on a fuzzy 
> query for instance.
> It can be problematic when the multi term query is boosted. 
> For instance consider the following query:
> `((title:bar~1)^100 text:bar)`
> This is a boolean query with two optional clauses. The first one is a fuzzy 
> query on the field title with a boost of 100. 
> If there is no expansion for "title:bar~1" the query is rewritten into:
> `(()^100 text:bar)`
> ... and when expansions are found:
> `((title:bars | title:bar)^100 text:bar)`
> The scoring of those two queries will differ because the normalization factor 
> and the norm for the first query will be equal to 1 (the boost is ignored 
> because the empty boolean query is not taken into account for the computation 
> of the normalization factor) whereas the second query will have a 
> normalization factor of 10,000 (100*100) and a norm equal to 0.01. 
> This kind of discrepancy can happen in a single index because the expansions 
> for the fuzzy query are done at the segment level. It can also happen when 
> multiple indices are requested (Solr/ElasticSearch case).
> A simple fix would be to replace the empty boolean query produced by the 
> multi term query with a MatchNoDocsQuery but I am not sure that it's the best 
> way to fix. WDYT ?
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7337) MultiTermQuery are sometimes rewritten into an empty boolean query

2016-06-14 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15329457#comment-15329457
 ] 

Ferenczi Jim commented on LUCENE-7337:
--

??A simple fix would be to replace the empty boolean query produced by the 
multi term query with a MatchNoDocsQuery but I am not sure that it's the best 
way to fix.??

I am not sure of this statement anymore. Conceptually a MatchNoDocsQuery and a 
BooleanQuery with no clause are similar. Though what I proposed assumed that 
the value for normalization of the MatchNoDocsQuery is 1. I think that doing 
this would bring confusion since this value is supposed to reflect the max 
score that the query can get (which is 0 in this case). Currently a boolean 
query or a disjunction query with no clause return 0 for the normalization. I 
think it's the expected behavior even though this breaks the distributed case 
as explained in my previous comment. 
For empty queries that are the result of an expansion (multi term query) maybe 
we could add yet another special query,  something like MatchNoExpansionQuery 
that would use a ConstantScoreWeight ? I am proposing this because this would 
make the distinction between a query that match no documents no matter what the 
context is and a query that match no documents because of the context (useful 
for the distributed case).

> MultiTermQuery are sometimes rewritten into an empty boolean query
> --
>
> Key: LUCENE-7337
> URL: https://issues.apache.org/jira/browse/LUCENE-7337
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: Ferenczi Jim
>Priority: Minor
>
> MultiTermQuery are sometimes rewritten to an empty boolean query (depending 
> on the rewrite method), it can happen when no expansions are found on a fuzzy 
> query for instance.
> It can be problematic when the multi term query is boosted. 
> For instance consider the following query:
> `((title:bar~1)^100 text:bar)`
> This is a boolean query with two optional clauses. The first one is a fuzzy 
> query on the field title with a boost of 100. 
> If there is no expansion for "title:bar~1" the query is rewritten into:
> `(()^100 text:bar)`
> ... and when expansions are found:
> `((title:bars | title:bar)^100 text:bar)`
> The scoring of those two queries will differ because the normalization factor 
> and the norm for the first query will be equal to 1 (the boost is ignored 
> because the empty boolean query is not taken into account for the computation 
> of the normalization factor) whereas the second query will have a 
> normalization factor of 10,000 (100*100) and a norm equal to 0.01. 
> This kind of discrepancy can happen in a single index because the expansions 
> for the fuzzy query are done at the segment level. It can also happen when 
> multiple indices are requested (Solr/ElasticSearch case).
> A simple fix would be to replace the empty boolean query produced by the 
> multi term query with a MatchNoDocsQuery but I am not sure that it's the best 
> way to fix. WDYT ?
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7276) Add an optional reason to the MatchNoDocsQuery

2016-06-13 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327946#comment-15327946
 ] 

Ferenczi Jim edited comment on LUCENE-7276 at 6/13/16 7:02 PM:
---

??Somehow the test is angry that the rewritten query scores differently from 
the original ... so somehow the fact that we no longer rewrite to an empty BQ 
is changing something ... I'll dig.??

I tried to find a reason and I think I found something interesting. The change 
is related to the normalization factor and the fact that those queries are 
boosted. When you use a boolean query with no clause the normalization factor 
is 0, when the matchnodocs query is used the normalization factor is 1 
(BooleanWeight.getValueForNormalization and 
ConstantScoreWeight.getValueForNormalization).
This part of the query is supposed to return no documents so it should be ok to 
ignore it when the query norm is computed. Though for the distributed case 
where results are merged from different shards there is no guarantee that the 
rewrite will be the same among the shards. 
I think we can get rid of the matchnodocsquery vs empty boolean query 
difference if we change the return value of  
BooleanWeight.getValueForNormalization to be 1 (instead of 0) when there is no 
clause.

https://issues.apache.org/jira/browse/LUCENE-7337


was (Author: jim.ferenczi):
??
Somehow the test is angry that the rewritten query scores differently from the 
original ... so somehow the fact that we no longer rewrite to an empty BQ is 
changing something ... I'll dig.
??

I tried to find a reason and I think I found something interesting. The change 
is related to the normalization factor and the fact that those queries are 
boosted. When you use a boolean query with no clause the normalization factor 
is 0, when the matchnodocs query is used the normalization factor is 1 
(BooleanWeight.getValueForNormalization and 
ConstantScoreWeight.getValueForNormalization).
This part of the query is supposed to return no documents so it should be ok to 
ignore it when the query norm is computed. Though for the distributed case 
where results are merged from different shards there is no guarantee that the 
rewrite will be the same among the shards. 
I think we can get rid of the matchnodocsquery vs empty boolean query 
difference if we change the return value of  
BooleanWeight.getValueForNormalization to be 1 (instead of 0) when there is no 
clause.

https://issues.apache.org/jira/browse/LUCENE-7337

> Add an optional reason to the MatchNoDocsQuery
> --
>
> Key: LUCENE-7276
> URL: https://issues.apache.org/jira/browse/LUCENE-7276
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Ferenczi Jim
>Priority: Minor
>  Labels: patch
> Attachments: LUCENE-7276.patch, LUCENE-7276.patch, LUCENE-7276.patch, 
> LUCENE-7276.patch, LUCENE-7276.patch
>
>
> It's sometimes difficult to debug a query that results in a MatchNoDocsQuery. 
> The MatchNoDocsQuery is always rewritten in an empty boolean query.
> This patch adds an optional reason and implements a weight in order to keep 
> track of the reason why the query did not match any document. The reason is 
> printed on toString and when an explanation for noMatch is asked.  
> For instance the query:
> new MatchNoDocsQuery("Field not found").toString()
> => 'MatchNoDocsQuery["field 'title' not found"]'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7276) Add an optional reason to the MatchNoDocsQuery

2016-06-13 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327946#comment-15327946
 ] 

Ferenczi Jim commented on LUCENE-7276:
--

??
Somehow the test is angry that the rewritten query scores differently from the 
original ... so somehow the fact that we no longer rewrite to an empty BQ is 
changing something ... I'll dig.
??

I tried to find a reason and I think I found something interesting. The change 
is related to the normalization factor and the fact that those queries are 
boosted. When you use a boolean query with no clause the normalization factor 
is 0, when the matchnodocs query is used the normalization factor is 1 
(BooleanWeight.getValueForNormalization and 
ConstantScoreWeight.getValueForNormalization).
This part of the query is supposed to return no documents so it should be ok to 
ignore it when the query norm is computed. Though for the distributed case 
where results are merged from different shards there is no guarantee that the 
rewrite will be the same among the shards. 
I think we can get rid of the matchnodocsquery vs empty boolean query 
difference if we change the return value of  
BooleanWeight.getValueForNormalization to be 1 (instead of 0) when there is no 
clause.

https://issues.apache.org/jira/browse/LUCENE-7337

> Add an optional reason to the MatchNoDocsQuery
> --
>
> Key: LUCENE-7276
> URL: https://issues.apache.org/jira/browse/LUCENE-7276
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Ferenczi Jim
>Priority: Minor
>  Labels: patch
> Attachments: LUCENE-7276.patch, LUCENE-7276.patch, LUCENE-7276.patch, 
> LUCENE-7276.patch, LUCENE-7276.patch
>
>
> It's sometimes difficult to debug a query that results in a MatchNoDocsQuery. 
> The MatchNoDocsQuery is always rewritten in an empty boolean query.
> This patch adds an optional reason and implements a weight in order to keep 
> track of the reason why the query did not match any document. The reason is 
> printed on toString and when an explanation for noMatch is asked.  
> For instance the query:
> new MatchNoDocsQuery("Field not found").toString()
> => 'MatchNoDocsQuery["field 'title' not found"]'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7337) MultiTermQuery are sometimes rewritten into an empty boolean query

2016-06-13 Thread Ferenczi Jim (JIRA)
Ferenczi Jim created LUCENE-7337:


 Summary: MultiTermQuery are sometimes rewritten into an empty 
boolean query
 Key: LUCENE-7337
 URL: https://issues.apache.org/jira/browse/LUCENE-7337
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Reporter: Ferenczi Jim
Priority: Minor


MultiTermQuery are sometimes rewritten to an empty boolean query (depending on 
the rewrite method), it can happen when no expansions are found on a fuzzy 
query for instance.
It can be problematic when the multi term query is boosted. 
For instance consider the following query:

`((title:bar~1)^100 text:bar)`

This is a boolean query with two optional clauses. The first one is a fuzzy 
query on the field title with a boost of 100. 
If there is no expansion for "title:bar~1" the query is rewritten into:

`(()^100 text:bar)`

... and when expansions are found:

`((title:bars | title:bar)^100 text:bar)`

The scoring of those two queries will differ because the normalization factor 
and the norm for the first query will be equal to 1 (the boost is ignored 
because the empty boolean query is not taken into account for the computation 
of the normalization factor) whereas the second query will have a normalization 
factor of 10,000 (100*100) and a norm equal to 0.01. 

This kind of discrepancy can happen in a single index because the expansions 
for the fuzzy query are done at the segment level. It can also happen when 
multiple indices are requested (Solr/ElasticSearch case).

A simple fix would be to replace the empty boolean query produced by the multi 
term query with a MatchNoDocsQuery but I am not sure that it's the best way to 
fix. WDYT ?
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7276) Add an optional reason to the MatchNoDocsQuery

2016-05-09 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7276:
-
Attachment: LUCENE-7276.patch

Patch available.

> Add an optional reason to the MatchNoDocsQuery
> --
>
> Key: LUCENE-7276
> URL: https://issues.apache.org/jira/browse/LUCENE-7276
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Ferenczi Jim
>Priority: Minor
>  Labels: patch
> Attachments: LUCENE-7276.patch
>
>
> It's sometimes difficult to debug a query that results in a MatchNoDocsQuery. 
> The MatchNoDocsQuery is always rewritten in an empty boolean query.
> This patch adds an optional reason and implements a weight in order to keep 
> track of the reason why the query did not match any document. The reason is 
> printed on toString and when an explanation for noMatch is asked.  
> For instance the query:
> new MatchNoDocsQuery("Field not found").toString()
> => 'MatchNoDocsQuery["field 'title' not found"]'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7276) Add an optional reason to the MatchNoDocsQuery

2016-05-09 Thread Ferenczi Jim (JIRA)
Ferenczi Jim created LUCENE-7276:


 Summary: Add an optional reason to the MatchNoDocsQuery
 Key: LUCENE-7276
 URL: https://issues.apache.org/jira/browse/LUCENE-7276
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/search
Reporter: Ferenczi Jim
Priority: Minor


It's sometimes difficult to debug a query that results in a MatchNoDocsQuery. 
The MatchNoDocsQuery is always rewritten in an empty boolean query.
This patch adds an optional reason and implements a weight in order to keep 
track of the reason why the query did not match any document. The reason is 
printed on toString and when an explanation for noMatch is asked.  
For instance the query:
new MatchNoDocsQuery("Field not found").toString()
=> 'MatchNoDocsQuery["field 'title' not found"]'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-6972) QueryBuilder should not differentiate single position and multiple positions queries when the analyzer produces synonyms.

2016-01-18 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105867#comment-15105867
 ] 

Ferenczi Jim edited comment on LUCENE-6972 at 1/18/16 9:58 PM:
---

Rewrote the patch after [~rcmuir] suggestion about checking coordination factor 
instead. [~jpountz] can you check ?  


was (Author: jim.ferenczi):
Rewrote the patch after @rcmuir suggestion about checking coordination factor 
instead. [~jpountz] can you check ?  

> QueryBuilder should not differentiate single position and multiple positions 
> queries when the analyzer produces synonyms.  
> ---
>
> Key: LUCENE-6972
> URL: https://issues.apache.org/jira/browse/LUCENE-6972
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.4, 5.5
>Reporter: Ferenczi Jim
>Assignee: Adrien Grand
> Fix For: 5.5
>
> Attachments: LUCENE-6972.patch, LUCENE-6972.patch
>
>
> When synonyms are involved the querybuilder differentiate two cases. When 
> there is only one position the query is composed of one BooleanQuery which 
> contains multiple should clauses. This does not interact well when trying to 
> apply a minimum_should_match to the query. For instance if a field has a 
> synonym rule like "foo,bar" the query "foo" will produce:
> bq. (foo bar)
> ... two optional clauses at the root level. If we apply a minimum should 
> match of 50% then the query becomes:
> bq. (foo bar)~1 
> This seems wrong, the terms are at the same position.
> IMO the querybuilder should produce the following query:
> bq. ((foo bar))
> ... and a minimum should match of 50% should be not applicable to a query 
> with only one optional clause at the root level.
> The case with multiple positions works as expected. 
> The user query "test foo" generates:
> bq. (test (foo bar)) 
> ... and if we apply a minimum should match of 50%:
> bq. (test (foo bar))~1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6972) QueryBuilder should not differentiate single position and multiple positions queries when the analyzer produces synonyms.

2016-01-18 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-6972:
-
Attachment: LUCENE-6972.patch

Rewrote the patch after @rcmuir suggestion about checking coordination factor 
instead. [~jpountz] can you check ?  

> QueryBuilder should not differentiate single position and multiple positions 
> queries when the analyzer produces synonyms.  
> ---
>
> Key: LUCENE-6972
> URL: https://issues.apache.org/jira/browse/LUCENE-6972
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.4, 5.5
>Reporter: Ferenczi Jim
>Assignee: Adrien Grand
> Fix For: 5.5
>
> Attachments: LUCENE-6972.patch, LUCENE-6972.patch
>
>
> When synonyms are involved the querybuilder differentiate two cases. When 
> there is only one position the query is composed of one BooleanQuery which 
> contains multiple should clauses. This does not interact well when trying to 
> apply a minimum_should_match to the query. For instance if a field has a 
> synonym rule like "foo,bar" the query "foo" will produce:
> bq. (foo bar)
> ... two optional clauses at the root level. If we apply a minimum should 
> match of 50% then the query becomes:
> bq. (foo bar)~1 
> This seems wrong, the terms are at the same position.
> IMO the querybuilder should produce the following query:
> bq. ((foo bar))
> ... and a minimum should match of 50% should be not applicable to a query 
> with only one optional clause at the root level.
> The case with multiple positions works as expected. 
> The user query "test foo" generates:
> bq. (test (foo bar)) 
> ... and if we apply a minimum should match of 50%:
> bq. (test (foo bar))~1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6972) QueryBuilder should not differentiate single position and multiple positions queries when the analyzer produces synonyms.

2016-01-13 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096811#comment-15096811
 ] 

Ferenczi Jim commented on LUCENE-6972:
--

Sorry I forgot to add a comment about merging into trunk. I don't think it is 
needed because the new SynonymQuery packs all the synonyms in the same query so 
minimum should match is not affected. Though there would be things to do (not 
in this issue) to handle single word synonyms that appear in a multi position 
query with a SynonymQuery (the analyzeMultiBoolean does not use the 
SynonymQuery).

> QueryBuilder should not differentiate single position and multiple positions 
> queries when the analyzer produces synonyms.  
> ---
>
> Key: LUCENE-6972
> URL: https://issues.apache.org/jira/browse/LUCENE-6972
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.4, 5.5
>Reporter: Ferenczi Jim
>Assignee: Adrien Grand
> Fix For: 5.5
>
> Attachments: LUCENE-6972.patch, LUCENE-6972.patch
>
>
> When synonyms are involved the querybuilder differentiate two cases. When 
> there is only one position the query is composed of one BooleanQuery which 
> contains multiple should clauses. This does not interact well when trying to 
> apply a minimum_should_match to the query. For instance if a field has a 
> synonym rule like "foo,bar" the query "foo" will produce:
> bq. (foo bar)
> ... two optional clauses at the root level. If we apply a minimum should 
> match of 50% then the query becomes:
> bq. (foo bar)~1 
> This seems wrong, the terms are at the same position.
> IMO the querybuilder should produce the following query:
> bq. ((foo bar))
> ... and a minimum should match of 50% should be not applicable to a query 
> with only one optional clause at the root level.
> The case with multiple positions works as expected. 
> The user query "test foo" generates:
> bq. (test (foo bar)) 
> ... and if we apply a minimum should match of 50%:
> bq. (test (foo bar))~1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6972) QueryBuilder should not differentiate single position and multiple positions queries when the analyzer produces synonyms.

2016-01-11 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-6972:
-
Lucene Fields: New,Patch Available  (was: New)

> QueryBuilder should not differentiate single position and multiple positions 
> queries when the analyzer produces synonyms.  
> ---
>
> Key: LUCENE-6972
> URL: https://issues.apache.org/jira/browse/LUCENE-6972
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.4, 5.5
>Reporter: Ferenczi Jim
> Fix For: 5.5
>
> Attachments: LUCENE-6972.patch
>
>
> When synonyms are involved the querybuilder differentiate two cases. When 
> there is only one position the query is composed of one BooleanQuery which 
> contains multiple should clauses. This does not interact well when trying to 
> apply a minimum_should_match to the query. For instance if a field has a 
> synonym rule like "foo,bar" the query "foo" will produce:
> bq. (foo bar)
> ... two optional clauses at the root level. If we apply a minimum should 
> match of 50% then the query becomes:
> bq. (foo bar)~1 
> This seems wrong, the terms are at the same position.
> IMO the querybuilder should produce the following query:
> bq. ((foo bar))
> ... and a minimum should match of 50% should be not applicable to a query 
> with only one optional clause at the root level.
> The case with multiple positions works as expected. 
> The user query "test foo" generates:
> bq. (test (foo bar)) 
> ... and if we apply a minimum should match of 50%:
> bq. (test (foo bar))~1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6972) QueryBuilder should not differentiate single position and multiple positions queries when the analyzer produces synonyms.

2016-01-11 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-6972:
-
Attachment: LUCENE-6972.patch

> QueryBuilder should not differentiate single position and multiple positions 
> queries when the analyzer produces synonyms.  
> ---
>
> Key: LUCENE-6972
> URL: https://issues.apache.org/jira/browse/LUCENE-6972
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.4, 5.5
>Reporter: Ferenczi Jim
> Fix For: 5.5
>
> Attachments: LUCENE-6972.patch
>
>
> When synonyms are involved the querybuilder differentiate two cases. When 
> there is only one position the query is composed of one BooleanQuery which 
> contains multiple should clauses. This does not interact well when trying to 
> apply a minimum_should_match to the query. For instance if a field has a 
> synonym rule like "foo,bar" the query "foo" will produce:
> bq. (foo bar)
> ... two optional clauses at the root level. If we apply a minimum should 
> match of 50% then the query becomes:
> bq. (foo bar)~1 
> This seems wrong, the terms are at the same position.
> IMO the querybuilder should produce the following query:
> bq. ((foo bar))
> ... and a minimum should match of 50% should be not applicable to a query 
> with only one optional clause at the root level.
> The case with multiple positions works as expected. 
> The user query "test foo" generates:
> bq. (test (foo bar)) 
> ... and if we apply a minimum should match of 50%:
> bq. (test (foo bar))~1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6972) QueryBuilder should not differentiate single position and multiple positions queries when the analyzer produces synonyms.

2016-01-11 Thread Ferenczi Jim (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-6972:
-
Description: 
When synonyms are involved the querybuilder differentiate two cases. When there 
is only one position the query is composed of one BooleanQuery which contains 
multiple should clauses. This does not interact well when trying to apply a 
minimum_should_match to the query. For instance if a field has a synonym rule 
like "foo,bar" the query "foo" will produce:
bq. (foo bar)
... two optional clauses at the root level. If we apply a minimum should match 
of 50% then the query becomes:
bq. (foo bar)~1 
This seems wrong, the terms are at the same position.
IMO the querybuilder should produce the following query:
bq. ((foo bar))
... and a minimum should match of 50% should be not applicable to a query with 
only one optional clause at the root level.
The case with multiple positions works as expected. 
The user query "test foo" generates:
bq. (test (foo bar)) 
... and if we apply a minimum should match of 50%:
bq. (test (foo bar))~1

  was:
When synonyms are involved the querybuilder differentiate two cases. When there 
is only one position the query is composed of one BooleanQuery which contains 
multiple should clauses. This does not interact well when trying to apply a 
minimum_should_match to the query. For instance if a field has a synonym rule 
like "foo,bar" the query "foo" will produce:
"(foo bar)"
... two optional clauses at the root level. If we apply a minimum should match 
of 50% then the query becomes:
"(foo bar)~1". 
This seems wrong, the terms are at the same position.
IMO the querybuilder should produce the following query:
"((foo bar))" 
... and a minimum should match of 50% should be not applicable to a query with 
only one optional clause at the root level.
The case with multiple positions works as expected. 
The user query "test foo" generates:
 "(test (foo bar))" 
... and if we apply a minimum should match of 50%:
"(test (foo bar))~1"


> QueryBuilder should not differentiate single position and multiple positions 
> queries when the analyzer produces synonyms.  
> ---
>
> Key: LUCENE-6972
> URL: https://issues.apache.org/jira/browse/LUCENE-6972
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.4, 5.5
>Reporter: Ferenczi Jim
> Fix For: 5.5
>
>
> When synonyms are involved the querybuilder differentiate two cases. When 
> there is only one position the query is composed of one BooleanQuery which 
> contains multiple should clauses. This does not interact well when trying to 
> apply a minimum_should_match to the query. For instance if a field has a 
> synonym rule like "foo,bar" the query "foo" will produce:
> bq. (foo bar)
> ... two optional clauses at the root level. If we apply a minimum should 
> match of 50% then the query becomes:
> bq. (foo bar)~1 
> This seems wrong, the terms are at the same position.
> IMO the querybuilder should produce the following query:
> bq. ((foo bar))
> ... and a minimum should match of 50% should be not applicable to a query 
> with only one optional clause at the root level.
> The case with multiple positions works as expected. 
> The user query "test foo" generates:
> bq. (test (foo bar)) 
> ... and if we apply a minimum should match of 50%:
> bq. (test (foo bar))~1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-6972) QueryBuilder should not differentiate single position and multiple positions queries when the analyzer produces synonyms.

2016-01-11 Thread Ferenczi Jim (JIRA)
Ferenczi Jim created LUCENE-6972:


 Summary: QueryBuilder should not differentiate single position and 
multiple positions queries when the analyzer produces synonyms.  
 Key: LUCENE-6972
 URL: https://issues.apache.org/jira/browse/LUCENE-6972
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 5.4, 5.5
Reporter: Ferenczi Jim
 Fix For: 5.5


When synonyms are involved the querybuilder differentiate two cases. When there 
is only one position the query is composed of one BooleanQuery which contains 
multiple should clauses. This does not interact well when trying to apply a 
minimum_should_match to the query. For instance if a field has a synonym rule 
like "foo,bar" the query "foo" will produce:
"(foo bar)"
... two optional clauses at the root level. If we apply a minimum should match 
of 50% then the query becomes:
"(foo bar)~1". 
This seems wrong, the terms are at the same position.
IMO the querybuilder should produce the following query:
"((foo bar))" 
... and a minimum should match of 50% should be not applicable to a query with 
only one optional clause at the root level.
The case with multiple positions works as expected. 
The user query "test foo" generates:
 "(test (foo bar))" 
... and if we apply a minimum should match of 50%:
"(test (foo bar))~1"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7319) Workaround the "Four Month Bug" causing GC pause problems

2015-04-01 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390220#comment-14390220
 ] 

Ferenczi Jim commented on SOLR-7319:


Thanks [~elyograg]. We are big fans of your pages about the settings for Solr 
regarding the Garbage Collector. We changed a lot of our settings after reading 
your page and we are know happy with the GC performance in our setup. I guess 
that providing good defaults values for all use cases is almost impossible and 
that each deployment/use cases would need a round of testing to find optimal 
values (especially for the tenuring threshold and the size of the heap). Anyway 
I think that most of the Solr users would be happy to have default values 
optimized by Solr expert. For those who think that they can have better 
performance with other settings nothing prevent them to change those defaults 
;) My initial point was that the defaults options should not break any external 
tool accessing Solr especially if it prevents the user to monitor the GC with 
jstat.

> Workaround the "Four Month Bug" causing GC pause problems
> -
>
> Key: SOLR-7319
> URL: https://issues.apache.org/jira/browse/SOLR-7319
> Project: Solr
>  Issue Type: Bug
>  Components: scripts and tools
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shawn Heisey
> Attachments: SOLR-7319.patch, SOLR-7319.patch, SOLR-7319.patch
>
>
> A twitter engineer found a bug in the JVM that contributes to GC pause 
> problems:
> http://www.evanjones.ca/jvm-mmap-pause.html
> Problem summary (in case the blog post disappears):  The JVM calculates 
> statistics on things like garbage collection and writes them to a file in the 
> temp directory using MMAP.  If there is a lot of other MMAP write activity, 
> which is precisely how Lucene accomplishes indexing and merging, it can 
> result in a GC pause because the mmap write to the temp file is delayed.
> We should implement the workaround in the solr start scripts (disable 
> creation of the mmap statistics tempfile) and document the impact in 
> CHANGES.txt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7319) Workaround the "Four Month Bug" causing GC pause problems

2015-03-31 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388154#comment-14388154
 ] 

Ferenczi Jim commented on SOLR-7319:


Most of the java options in the solr.in.cmd should not be activated by default. 
The tenuring threshold, the numbers of threads for the GC, ..., they all depend 
on the type of deployment you have, the size of the heap and the machine 
hosting the Solr node. In my company we are using a custom script full of java 
options that we added over the years. Most of the options are here because 
somebody added this with the assertion that the performance are better. Most of 
the time, we don't know what the option is for but nobody wants to remove it 
because the urban legend says it's useful. The solr startup script should be 
almost empty (at least for the java options), maybe one or two options to set 
up the garbage collector and that's it.

> Workaround the "Four Month Bug" causing GC pause problems
> -
>
> Key: SOLR-7319
> URL: https://issues.apache.org/jira/browse/SOLR-7319
> Project: Solr
>  Issue Type: Bug
>  Components: scripts and tools
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shawn Heisey
> Fix For: 5.1
>
> Attachments: SOLR-7319.patch, SOLR-7319.patch, SOLR-7319.patch
>
>
> A twitter engineer found a bug in the JVM that contributes to GC pause 
> problems:
> http://www.evanjones.ca/jvm-mmap-pause.html
> Problem summary (in case the blog post disappears):  The JVM calculates 
> statistics on things like garbage collection and writes them to a file in the 
> temp directory using MMAP.  If there is a lot of other MMAP write activity, 
> which is precisely how Lucene accomplishes indexing and merging, it can 
> result in a GC pause because the mmap write to the temp file is delayed.
> We should implement the workaround in the solr start scripts (disable 
> creation of the mmap statistics tempfile) and document the impact in 
> CHANGES.txt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7319) Workaround the "Four Month Bug" causing GC pause problems

2015-03-30 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386786#comment-14386786
 ] 

Ferenczi Jim commented on SOLR-7319:


I am saying this because if we are not sure that Lucene is impacted we should 
not add this in the default options. Not being able to do a jstat on a running 
node is problematic and will break a lot of monitoring tools built on top of 
Solr.

> Workaround the "Four Month Bug" causing GC pause problems
> -
>
> Key: SOLR-7319
> URL: https://issues.apache.org/jira/browse/SOLR-7319
> Project: Solr
>  Issue Type: Bug
>  Components: scripts and tools
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shawn Heisey
> Fix For: 5.1
>
> Attachments: SOLR-7319.patch, SOLR-7319.patch, SOLR-7319.patch
>
>
> A twitter engineer found a bug in the JVM that contributes to GC pause 
> problems:
> http://www.evanjones.ca/jvm-mmap-pause.html
> Problem summary (in case the blog post disappears):  The JVM calculates 
> statistics on things like garbage collection and writes them to a file in the 
> temp directory using MMAP.  If there is a lot of other MMAP write activity, 
> which is precisely how Lucene accomplishes indexing and merging, it can 
> result in a GC pause because the mmap write to the temp file is delayed.
> We should implement the workaround in the solr start scripts (disable 
> creation of the mmap statistics tempfile) and document the impact in 
> CHANGES.txt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7319) Workaround the "Four Month Bug" causing GC pause problems

2015-03-30 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386787#comment-14386787
 ] 

Ferenczi Jim commented on SOLR-7319:


I am saying this because if we are not sure that Lucene is impacted we should 
not add this in the default options. Not being able to do a jstat on a running 
node is problematic and will break a lot of monitoring tools built on top of 
Solr.

> Workaround the "Four Month Bug" causing GC pause problems
> -
>
> Key: SOLR-7319
> URL: https://issues.apache.org/jira/browse/SOLR-7319
> Project: Solr
>  Issue Type: Bug
>  Components: scripts and tools
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shawn Heisey
> Fix For: 5.1
>
> Attachments: SOLR-7319.patch, SOLR-7319.patch, SOLR-7319.patch
>
>
> A twitter engineer found a bug in the JVM that contributes to GC pause 
> problems:
> http://www.evanjones.ca/jvm-mmap-pause.html
> Problem summary (in case the blog post disappears):  The JVM calculates 
> statistics on things like garbage collection and writes them to a file in the 
> temp directory using MMAP.  If there is a lot of other MMAP write activity, 
> which is precisely how Lucene accomplishes indexing and merging, it can 
> result in a GC pause because the mmap write to the temp file is delayed.
> We should implement the workaround in the solr start scripts (disable 
> creation of the mmap statistics tempfile) and document the impact in 
> CHANGES.txt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7319) Workaround the "Four Month Bug" causing GC pause problems

2015-03-30 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386754#comment-14386754
 ] 

Ferenczi Jim commented on SOLR-7319:


"If there is a lot of other MMAP write activity, which is precisely how Lucene 
accomplishes indexing and merging" => Are you sure about this statement, 
MMapDirectory uses MMap for reads and a simple RandomAccessFile for writes. I 
don't know how the RandomAccessFile is implemented but I doubt it's using MMap 
at all. 

> Workaround the "Four Month Bug" causing GC pause problems
> -
>
> Key: SOLR-7319
> URL: https://issues.apache.org/jira/browse/SOLR-7319
> Project: Solr
>  Issue Type: Bug
>  Components: scripts and tools
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shawn Heisey
> Fix For: 5.1
>
> Attachments: SOLR-7319.patch, SOLR-7319.patch, SOLR-7319.patch
>
>
> A twitter engineer found a bug in the JVM that contributes to GC pause 
> problems:
> http://www.evanjones.ca/jvm-mmap-pause.html
> Problem summary (in case the blog post disappears):  The JVM calculates 
> statistics on things like garbage collection and writes them to a file in the 
> temp directory using MMAP.  If there is a lot of other MMAP write activity, 
> which is precisely how Lucene accomplishes indexing and merging, it can 
> result in a GC pause because the mmap write to the temp file is delayed.
> We should implement the workaround in the solr start scripts (disable 
> creation of the mmap statistics tempfile) and document the impact in 
> CHANGES.txt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6606) In cloud mode the leader should distribute autoCommits to it's replicas

2014-10-13 Thread Ferenczi Jim (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169142#comment-14169142
 ] 

Ferenczi Jim commented on SOLR-6606:


Is is intended for partial recovery or only to distribute the downloads on a 
full recovery (when the Replica downloads a full index from the master) ? I am 
saying this because for the partial recovery you need to handle the deletes as 
well. For instance if one replica missed the last commit he could download the 
segment from the master but he would loose all the deletes related to this 
update. Keeping the list of every delete per commit seems mandatory but also 
very expensive unless we can garbage collect them at some point. 

> In cloud mode the leader should distribute autoCommits to it's replicas
> ---
>
> Key: SOLR-6606
> URL: https://issues.apache.org/jira/browse/SOLR-6606
> Project: Solr
>  Issue Type: Improvement
>Reporter: Varun Thacker
> Fix For: 5.0, Trunk
>
> Attachments: SOLR-6606.patch, SOLR-6606.patch
>
>
> Today in SolrCloud different replicas of a shard can trigger auto (hard) 
> commits at different times. Although the documents which get added to the 
> system remain consistent the way the segments gets formed can be different 
> because of this.
> The downside of segments not getting formed in an identical fashion across 
> replicas is that when a replica goes into recovery chances are that it has to 
> do a full index replication from the leader. This is time consuming and we 
> can possibly avoid this if the leader forwards auto (hard) commit commands to 
> it's replicas and the replicas never explicitly trigger an auto (hard) commit.
> I am working on a patch. Should have it up shortly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org