[jira] [Commented] (LUCENE-7976) Add a parameter to TieredMergePolicy to merge segments that have more than X percent deleted documents

2017-10-02 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188704#comment-16188704
 ] 

Nik Everett commented on LUCENE-7976:
-

I had this issue on a previous project. Our indices were smaller than what you 
are talking about but we did have one or two of the max size segments that 
refused to merge away their deleted documents until they got to 50%. We had a 
fairly high update rate and a very high query rate. The deleted documents 
bloated the working set size somewhat causing more IO which was our bottleneck 
at the time. I would have been happy to pay for the increased merge IO to have 
lower query time IO.

We ultimately solved the problem by throwing money at it. More ram and better 
SSDs makes life much easier. I would have liked to have solved the problem in 
software but as an very infrequent contributor I didn't feel like I'd ever get 
a change to TieredMergePolicy merged.

> Add a parameter to TieredMergePolicy to merge segments that have more than X 
> percent deleted documents
> --
>
> Key: LUCENE-7976
> URL: https://issues.apache.org/jira/browse/LUCENE-7976
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>
> We're seeing situations "in the wild" where there are very large indexes (on 
> disk) handled quite easily in a single Lucene index. This is particularly 
> true as features like docValues move data into MMapDirectory space. The 
> current TMP algorithm allows on the order of 50% deleted documents as per a 
> dev list conversation with Mike McCandless (and his blog here:  
> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).
> Especially in the current era of very large indexes in aggregate, (think many 
> TB) solutions like "you need to distribute your collection over more shards" 
> become very costly. Additionally, the tempting "optimize" button exacerbates 
> the issue since once you form, say, a 100G segment (by 
> optimizing/forceMerging) it is not eligible for merging until 97.5G of the 
> docs in it are deleted (current default 5G max segment size).
> The proposal here would be to add a new parameter to TMP, something like 
>  (no, that's not serious name, suggestions 
> welcome) which would default to 100 (or the same behavior we have now).
> So if I set this parameter to, say, 20%, and the max segment size stays at 
> 5G, the following would happen when segments were selected for merging:
> > any segment with > 20% deleted documents would be merged or rewritten NO 
> > MATTER HOW LARGE. There are two cases,
> >> the segment has < 5G "live" docs. In that case it would be merged with 
> >> smaller segments to bring the resulting segment up to 5G. If no smaller 
> >> segments exist, it would just be rewritten
> >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). 
> >> It would be rewritten into a single segment removing all deleted docs no 
> >> matter how big it is to start. The 100G example above would be rewritten 
> >> to an 80G segment for instance.
> Of course this would lead to potentially much more I/O which is why the 
> default would be the same behavior we see now. As it stands now, though, 
> there's no way to recover from an optimize/forceMerge except to re-index from 
> scratch. We routinely see 200G-300G Lucene indexes at this point "in the 
> wild" with 10s of  shards replicated 3 or more times. And that doesn't even 
> include having these over HDFS.
> Alternatives welcome! Something like the above seems minimally invasive. A 
> new merge policy is certainly an alternative.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6334) Fast Vector Highlighter does not properly span neighboring term offsets

2015-07-28 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-6334:

Attachment: LUCENE-6334.patch

 Fast Vector Highlighter does not properly span neighboring term offsets
 ---

 Key: LUCENE-6334
 URL: https://issues.apache.org/jira/browse/LUCENE-6334
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/termvectors, modules/highlighter
Reporter: Chris Earle
  Labels: easyfix
 Attachments: LUCENE-6334.patch, LUCENE-6334.patch, LUCENE-6334.patch


 If you are using term vectors for fast vector highlighting along with a 
 multivalue field while matching a phrase that crosses two elements, then it 
 will not properly highlight even though it _properly_ finds the correct 
 values to highlight.
 A good example of this is when matching source code, where you might have 
 lines like:
 {code}
 one two three five
 two three four
 five six five
 six seven eight nine eight nine eight nine eight nine eight nine eight nine
 eight nine
 ten eleven
 twelve thirteen
 {code}
 Matching the phrase four five will return
 {code}
 two three four
 five six five
 six seven eight nine eight nine eight nine eight nine eight
 eight nine
 ten eleven
 {code}
 However, it does not properly highlight four (on the first line) and five 
 (on the second line) _and_ it is returning too many lines, but not all of 
 them.
 The problem lies in the [BaseFragmentsBuilder at line 269| 
 https://github.com/apache/lucene-solr/blob/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/BaseFragmentsBuilder.java#L269]
  because it is not checking for cross-coverage. Here is a possible solution:
 {code}
 boolean started = toffs.getStartOffset() = fieldStart;
 boolean ended = toffs.getEndOffset() = fieldEnd;
 // existing behavior:
 if (started  ended) {
 toffsList.add(toffs);
 toffsIterator.remove();
 }
 else if (started) {
 toffsList.add(new Toffs(toffs.getStartOffset(), field.end));
 // toffsIterator.remove(); // is this necessary?
 }
 else if (ended) {
 toffsList.add(new Toffs(fieldStart, toff.getEndOffset()));
 // toffsIterator.remove(); // is this necessary?
 }
 else if (toffs.getEndOffset()  fieldEnd) {
 // ie the toff spans whole field
 toffsList.add(new Toffs(fieldStart, fieldEnd));
 // toffsIterator.remove(); // is this necessary?
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6334) Fast Vector Highlighter does not properly span neighboring term offsets

2015-07-28 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-6334:

Attachment: LUCENE-6334.patch

 Fast Vector Highlighter does not properly span neighboring term offsets
 ---

 Key: LUCENE-6334
 URL: https://issues.apache.org/jira/browse/LUCENE-6334
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/termvectors, modules/highlighter
Reporter: Chris Earle
  Labels: easyfix
 Attachments: LUCENE-6334.patch, LUCENE-6334.patch


 If you are using term vectors for fast vector highlighting along with a 
 multivalue field while matching a phrase that crosses two elements, then it 
 will not properly highlight even though it _properly_ finds the correct 
 values to highlight.
 A good example of this is when matching source code, where you might have 
 lines like:
 {code}
 one two three five
 two three four
 five six five
 six seven eight nine eight nine eight nine eight nine eight nine eight nine
 eight nine
 ten eleven
 twelve thirteen
 {code}
 Matching the phrase four five will return
 {code}
 two three four
 five six five
 six seven eight nine eight nine eight nine eight nine eight
 eight nine
 ten eleven
 {code}
 However, it does not properly highlight four (on the first line) and five 
 (on the second line) _and_ it is returning too many lines, but not all of 
 them.
 The problem lies in the [BaseFragmentsBuilder at line 269| 
 https://github.com/apache/lucene-solr/blob/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/BaseFragmentsBuilder.java#L269]
  because it is not checking for cross-coverage. Here is a possible solution:
 {code}
 boolean started = toffs.getStartOffset() = fieldStart;
 boolean ended = toffs.getEndOffset() = fieldEnd;
 // existing behavior:
 if (started  ended) {
 toffsList.add(toffs);
 toffsIterator.remove();
 }
 else if (started) {
 toffsList.add(new Toffs(toffs.getStartOffset(), field.end));
 // toffsIterator.remove(); // is this necessary?
 }
 else if (ended) {
 toffsList.add(new Toffs(fieldStart, toff.getEndOffset()));
 // toffsIterator.remove(); // is this necessary?
 }
 else if (toffs.getEndOffset()  fieldEnd) {
 // ie the toff spans whole field
 toffsList.add(new Toffs(fieldStart, fieldEnd));
 // toffsIterator.remove(); // is this necessary?
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6334) Fast Vector Highlighter does not properly span neighboring term offsets

2015-07-28 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-6334:

Attachment: LUCENE-6334.patch

Test case and fix based on examples and source code provided in problem 
description. I started with the proposed fix and modified it quite a bit to get 
something that should get the job done. Also expanded on the proposed test 
cases to include things like phrases that span entire values.

 Fast Vector Highlighter does not properly span neighboring term offsets
 ---

 Key: LUCENE-6334
 URL: https://issues.apache.org/jira/browse/LUCENE-6334
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/termvectors, modules/highlighter
Reporter: Chris Earle
  Labels: easyfix
 Attachments: LUCENE-6334.patch, LUCENE-6334.patch


 If you are using term vectors for fast vector highlighting along with a 
 multivalue field while matching a phrase that crosses two elements, then it 
 will not properly highlight even though it _properly_ finds the correct 
 values to highlight.
 A good example of this is when matching source code, where you might have 
 lines like:
 {code}
 one two three five
 two three four
 five six five
 six seven eight nine eight nine eight nine eight nine eight nine eight nine
 eight nine
 ten eleven
 twelve thirteen
 {code}
 Matching the phrase four five will return
 {code}
 two three four
 five six five
 six seven eight nine eight nine eight nine eight nine eight
 eight nine
 ten eleven
 {code}
 However, it does not properly highlight four (on the first line) and five 
 (on the second line) _and_ it is returning too many lines, but not all of 
 them.
 The problem lies in the [BaseFragmentsBuilder at line 269| 
 https://github.com/apache/lucene-solr/blob/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/BaseFragmentsBuilder.java#L269]
  because it is not checking for cross-coverage. Here is a possible solution:
 {code}
 boolean started = toffs.getStartOffset() = fieldStart;
 boolean ended = toffs.getEndOffset() = fieldEnd;
 // existing behavior:
 if (started  ended) {
 toffsList.add(toffs);
 toffsIterator.remove();
 }
 else if (started) {
 toffsList.add(new Toffs(toffs.getStartOffset(), field.end));
 // toffsIterator.remove(); // is this necessary?
 }
 else if (ended) {
 toffsList.add(new Toffs(fieldStart, toff.getEndOffset()));
 // toffsIterator.remove(); // is this necessary?
 }
 else if (toffs.getEndOffset()  fieldEnd) {
 // ie the toff spans whole field
 toffsList.add(new Toffs(fieldStart, fieldEnd));
 // toffsIterator.remove(); // is this necessary?
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6334) Fast Vector Highlighter does not properly span neighboring term offsets

2015-07-28 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-6334:

Attachment: (was: LUCENE-6334.patch)

 Fast Vector Highlighter does not properly span neighboring term offsets
 ---

 Key: LUCENE-6334
 URL: https://issues.apache.org/jira/browse/LUCENE-6334
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/termvectors, modules/highlighter
Reporter: Chris Earle
  Labels: easyfix
 Attachments: LUCENE-6334.patch


 If you are using term vectors for fast vector highlighting along with a 
 multivalue field while matching a phrase that crosses two elements, then it 
 will not properly highlight even though it _properly_ finds the correct 
 values to highlight.
 A good example of this is when matching source code, where you might have 
 lines like:
 {code}
 one two three five
 two three four
 five six five
 six seven eight nine eight nine eight nine eight nine eight nine eight nine
 eight nine
 ten eleven
 twelve thirteen
 {code}
 Matching the phrase four five will return
 {code}
 two three four
 five six five
 six seven eight nine eight nine eight nine eight nine eight
 eight nine
 ten eleven
 {code}
 However, it does not properly highlight four (on the first line) and five 
 (on the second line) _and_ it is returning too many lines, but not all of 
 them.
 The problem lies in the [BaseFragmentsBuilder at line 269| 
 https://github.com/apache/lucene-solr/blob/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/BaseFragmentsBuilder.java#L269]
  because it is not checking for cross-coverage. Here is a possible solution:
 {code}
 boolean started = toffs.getStartOffset() = fieldStart;
 boolean ended = toffs.getEndOffset() = fieldEnd;
 // existing behavior:
 if (started  ended) {
 toffsList.add(toffs);
 toffsIterator.remove();
 }
 else if (started) {
 toffsList.add(new Toffs(toffs.getStartOffset(), field.end));
 // toffsIterator.remove(); // is this necessary?
 }
 else if (ended) {
 toffsList.add(new Toffs(fieldStart, toff.getEndOffset()));
 // toffsIterator.remove(); // is this necessary?
 }
 else if (toffs.getEndOffset()  fieldEnd) {
 // ie the toff spans whole field
 toffsList.add(new Toffs(fieldStart, fieldEnd));
 // toffsIterator.remove(); // is this necessary?
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6334) Fast Vector Highlighter does not properly span neighboring term offsets

2015-07-28 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-6334:

Attachment: (was: LUCENE-6334.patch)

 Fast Vector Highlighter does not properly span neighboring term offsets
 ---

 Key: LUCENE-6334
 URL: https://issues.apache.org/jira/browse/LUCENE-6334
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/termvectors, modules/highlighter
Reporter: Chris Earle
  Labels: easyfix
 Attachments: LUCENE-6334.patch


 If you are using term vectors for fast vector highlighting along with a 
 multivalue field while matching a phrase that crosses two elements, then it 
 will not properly highlight even though it _properly_ finds the correct 
 values to highlight.
 A good example of this is when matching source code, where you might have 
 lines like:
 {code}
 one two three five
 two three four
 five six five
 six seven eight nine eight nine eight nine eight nine eight nine eight nine
 eight nine
 ten eleven
 twelve thirteen
 {code}
 Matching the phrase four five will return
 {code}
 two three four
 five six five
 six seven eight nine eight nine eight nine eight nine eight
 eight nine
 ten eleven
 {code}
 However, it does not properly highlight four (on the first line) and five 
 (on the second line) _and_ it is returning too many lines, but not all of 
 them.
 The problem lies in the [BaseFragmentsBuilder at line 269| 
 https://github.com/apache/lucene-solr/blob/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/BaseFragmentsBuilder.java#L269]
  because it is not checking for cross-coverage. Here is a possible solution:
 {code}
 boolean started = toffs.getStartOffset() = fieldStart;
 boolean ended = toffs.getEndOffset() = fieldEnd;
 // existing behavior:
 if (started  ended) {
 toffsList.add(toffs);
 toffsIterator.remove();
 }
 else if (started) {
 toffsList.add(new Toffs(toffs.getStartOffset(), field.end));
 // toffsIterator.remove(); // is this necessary?
 }
 else if (ended) {
 toffsList.add(new Toffs(fieldStart, toff.getEndOffset()));
 // toffsIterator.remove(); // is this necessary?
 }
 else if (toffs.getEndOffset()  fieldEnd) {
 // ie the toff spans whole field
 toffsList.add(new Toffs(fieldStart, fieldEnd));
 // toffsIterator.remove(); // is this necessary?
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6334) Fast Vector Highlighter does not properly span neighboring term offsets

2015-07-27 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14643330#comment-14643330
 ] 

Nik Everett commented on LUCENE-6334:
-

Would anyone object to me having a look at this?

 Fast Vector Highlighter does not properly span neighboring term offsets
 ---

 Key: LUCENE-6334
 URL: https://issues.apache.org/jira/browse/LUCENE-6334
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/termvectors, modules/highlighter
Reporter: Chris Earle
  Labels: easyfix

 If you are using term vectors for fast vector highlighting along with a 
 multivalue field while matching a phrase that crosses two elements, then it 
 will not properly highlight even though it _properly_ finds the correct 
 values to highlight.
 A good example of this is when matching source code, where you might have 
 lines like:
 {code}
 one two three five
 two three four
 five six five
 six seven eight nine eight nine eight nine eight nine eight nine eight nine
 eight nine
 ten eleven
 twelve thirteen
 {code}
 Matching the phrase four five will return
 {code}
 two three four
 five six five
 six seven eight nine eight nine eight nine eight nine eight
 eight nine
 ten eleven
 {code}
 However, it does not properly highlight four (on the first line) and five 
 (on the second line) _and_ it is returning too many lines, but not all of 
 them.
 The problem lies in the [BaseFragmentsBuilder at line 269| 
 https://github.com/apache/lucene-solr/blob/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/BaseFragmentsBuilder.java#L269]
  because it is not checking for cross-coverage. Here is a possible solution:
 {code}
 boolean started = toffs.getStartOffset() = fieldStart;
 boolean ended = toffs.getEndOffset() = fieldEnd;
 // existing behavior:
 if (started  ended) {
 toffsList.add(toffs);
 toffsIterator.remove();
 }
 else if (started) {
 toffsList.add(new Toffs(toffs.getStartOffset(), field.end));
 // toffsIterator.remove(); // is this necessary?
 }
 else if (ended) {
 toffsList.add(new Toffs(fieldStart, toff.getEndOffset()));
 // toffsIterator.remove(); // is this necessary?
 }
 else if (toffs.getEndOffset()  fieldEnd) {
 // ie the toff spans whole field
 toffsList.add(new Toffs(fieldStart, fieldEnd));
 // toffsIterator.remove(); // is this necessary?
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-6054) RegExp.toAutomaton fails on #*

2014-11-07 Thread Nik Everett (JIRA)
Nik Everett created LUCENE-6054:
---

 Summary: RegExp.toAutomaton fails on #*
 Key: LUCENE-6054
 URL: https://issues.apache.org/jira/browse/LUCENE-6054
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Nik Everett
Priority: Minor


This throws an assertion error:
new RegExp(#*).toAutomaton(1000);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6054) RegExp.toAutomaton fails on #*

2014-11-07 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-6054:

Attachment: LUCENE-6054.diff

 RegExp.toAutomaton fails on #*
 --

 Key: LUCENE-6054
 URL: https://issues.apache.org/jira/browse/LUCENE-6054
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Nik Everett
Priority: Minor
 Attachments: LUCENE-6054.diff


 This throws an assertion error:
 new RegExp(#*).toAutomaton(1000);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use

2014-11-04 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196070#comment-14196070
 ] 

Nik Everett commented on LUCENE-6046:
-

A couple of updates:
This affects version 4.9 as well.  Probably all versions.  But its impact is 
likely minor enough to only be worth adding to the 4.10 line.

A found a few test cases that need lots and lots of states.  Any time you feed 
a couple hundred random unicode words to the automata you'll end up needing 
more than ten thousand states.  I've updated those tests to ask for a million 
states and they caught a few places where I hadn't been as diligent in piping 
maxDeterminizedStates through.

 RegExp.toAutomaton high memory use
 --

 Key: LUCENE-6046
 URL: https://issues.apache.org/jira/browse/LUCENE-6046
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/queryparser
Affects Versions: 4.10.1
Reporter: Lee Hinman
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch


 When creating an automaton from an org.apache.lucene.util.automaton.RegExp, 
 it's possible for the automaton to use so much memory it exceeds the maximum 
 array size for java.
 The following caused an OutOfMemoryError with a 32gb heap:
 {noformat}
 new 
 RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton();
 {noformat}
 When increased to a 60gb heap, the following exception is thrown:
 {noformat}
   1 java.lang.IllegalArgumentException: requested array size 2147483624 
 exceeds maximum array in java (2147483623)
   1 
 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0)
   1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168)
   1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295)
   1 
 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639)
   1 
 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741)
   1 
 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62)
   1 
 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51)
   1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477)
   1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use

2014-11-03 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194484#comment-14194484
 ] 

Nik Everett commented on LUCENE-6046:
-

I'm working on a first cut of something that does that.  Better regex 
implementation would be great but the biggest thing to me is being able to 
limit the amount of work the determinize operation performs.  Its such a costly 
operation that I don't think it should ever be really abstracted from the user. 
 Something like having determinize throw a checked exception when it performs 
too much work would make you keenly aware whenever you might be straying into 
exponential territory.

 RegExp.toAutomaton high memory use
 --

 Key: LUCENE-6046
 URL: https://issues.apache.org/jira/browse/LUCENE-6046
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/queryparser
Affects Versions: 4.10.1
Reporter: Lee Hinman
Assignee: Michael McCandless
Priority: Minor

 When creating an automaton from an org.apache.lucene.util.automaton.RegExp, 
 it's possible for the automaton to use so much memory it exceeds the maximum 
 array size for java.
 The following caused an OutOfMemoryError with a 32gb heap:
 {noformat}
 new 
 RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton();
 {noformat}
 When increased to a 60gb heap, the following exception is thrown:
 {noformat}
   1 java.lang.IllegalArgumentException: requested array size 2147483624 
 exceeds maximum array in java (2147483623)
   1 
 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0)
   1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168)
   1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295)
   1 
 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639)
   1 
 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741)
   1 
 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62)
   1 
 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51)
   1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477)
   1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use

2014-11-03 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194592#comment-14194592
 ] 

Nik Everett commented on LUCENE-6046:
-

Oh yeah, its totally running into 2^n territory legitiately here.  This is 
totally something that'd be rejected by a framework to prevent explosive growth 
during determination.

 RegExp.toAutomaton high memory use
 --

 Key: LUCENE-6046
 URL: https://issues.apache.org/jira/browse/LUCENE-6046
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/queryparser
Affects Versions: 4.10.1
Reporter: Lee Hinman
Assignee: Michael McCandless
Priority: Minor

 When creating an automaton from an org.apache.lucene.util.automaton.RegExp, 
 it's possible for the automaton to use so much memory it exceeds the maximum 
 array size for java.
 The following caused an OutOfMemoryError with a 32gb heap:
 {noformat}
 new 
 RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton();
 {noformat}
 When increased to a 60gb heap, the following exception is thrown:
 {noformat}
   1 java.lang.IllegalArgumentException: requested array size 2147483624 
 exceeds maximum array in java (2147483623)
   1 
 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0)
   1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168)
   1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295)
   1 
 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639)
   1 
 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741)
   1 
 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62)
   1 
 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51)
   1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477)
   1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6046) RegExp.toAutomaton high memory use

2014-11-03 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-6046:

Attachment: LUCENE-6046.patch

First cut at a patch.  Adds maxDeterminizedStates to Operations.determinize and 
pipes it through to tons of places.  I think its important never to hide when 
determinize is called because of how potentially heavy it is.  Forcing callers 
of MinimizationOperations.minimize, Operations.reverse, Operations.minus etc to 
specify maxDeterminizedStates makes it pretty clear that the automaton might be 
determinized during those processes.

I added an unchecked exception for when the Automaton can't be determinized 
within the specified number of state but I'm really tempted to change it to a 
checked exception to make it super duper obvious when determinization might 
occur.

 RegExp.toAutomaton high memory use
 --

 Key: LUCENE-6046
 URL: https://issues.apache.org/jira/browse/LUCENE-6046
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/queryparser
Affects Versions: 4.10.1
Reporter: Lee Hinman
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-6046.patch


 When creating an automaton from an org.apache.lucene.util.automaton.RegExp, 
 it's possible for the automaton to use so much memory it exceeds the maximum 
 array size for java.
 The following caused an OutOfMemoryError with a 32gb heap:
 {noformat}
 new 
 RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton();
 {noformat}
 When increased to a 60gb heap, the following exception is thrown:
 {noformat}
   1 java.lang.IllegalArgumentException: requested array size 2147483624 
 exceeds maximum array in java (2147483623)
   1 
 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0)
   1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168)
   1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295)
   1 
 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639)
   1 
 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741)
   1 
 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62)
   1 
 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51)
   1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477)
   1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use

2014-11-03 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194716#comment-14194716
 ] 

Nik Everett commented on LUCENE-6046:
-

Oh - I'm still running the solr tests against this.  I imagine they'll pass as 
they've been running fine for 30 minutes now but I should throw that out there 
in case someone gets them to fail with this before I do.

 RegExp.toAutomaton high memory use
 --

 Key: LUCENE-6046
 URL: https://issues.apache.org/jira/browse/LUCENE-6046
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/queryparser
Affects Versions: 4.10.1
Reporter: Lee Hinman
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-6046.patch


 When creating an automaton from an org.apache.lucene.util.automaton.RegExp, 
 it's possible for the automaton to use so much memory it exceeds the maximum 
 array size for java.
 The following caused an OutOfMemoryError with a 32gb heap:
 {noformat}
 new 
 RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton();
 {noformat}
 When increased to a 60gb heap, the following exception is thrown:
 {noformat}
   1 java.lang.IllegalArgumentException: requested array size 2147483624 
 exceeds maximum array in java (2147483623)
   1 
 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0)
   1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168)
   1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295)
   1 
 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639)
   1 
 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741)
   1 
 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62)
   1 
 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51)
   1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477)
   1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use

2014-11-03 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195033#comment-14195033
 ] 

Nik Everett commented on LUCENE-6046:
-

Oh no!  I wrote a very similar patch!  Sorry to duplicate effort there.  

I found that 10,000 states wasn't quite enough to handle some of the tests so I 
went with 1,000,000 as the default.  Its pretty darn huge but it does get the 
job done.

 RegExp.toAutomaton high memory use
 --

 Key: LUCENE-6046
 URL: https://issues.apache.org/jira/browse/LUCENE-6046
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/queryparser
Affects Versions: 4.10.1
Reporter: Lee Hinman
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-6046.patch, LUCENE-6046.patch


 When creating an automaton from an org.apache.lucene.util.automaton.RegExp, 
 it's possible for the automaton to use so much memory it exceeds the maximum 
 array size for java.
 The following caused an OutOfMemoryError with a 32gb heap:
 {noformat}
 new 
 RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton();
 {noformat}
 When increased to a 60gb heap, the following exception is thrown:
 {noformat}
   1 java.lang.IllegalArgumentException: requested array size 2147483624 
 exceeds maximum array in java (2147483623)
   1 
 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0)
   1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168)
   1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295)
   1 
 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639)
   1 
 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741)
   1 
 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62)
   1 
 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51)
   1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477)
   1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use

2014-11-03 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195056#comment-14195056
 ] 

Nik Everett commented on LUCENE-6046:
-

TestDeterminizeLexicon wants to make an automata that accepts 5000 random 
strings.  So 10,000 isn't enough states for it.  I'll drop the default limit to 
10,000 again and just feed a million to that test case. 

 RegExp.toAutomaton high memory use
 --

 Key: LUCENE-6046
 URL: https://issues.apache.org/jira/browse/LUCENE-6046
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/queryparser
Affects Versions: 4.10.1
Reporter: Lee Hinman
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-6046.patch, LUCENE-6046.patch


 When creating an automaton from an org.apache.lucene.util.automaton.RegExp, 
 it's possible for the automaton to use so much memory it exceeds the maximum 
 array size for java.
 The following caused an OutOfMemoryError with a 32gb heap:
 {noformat}
 new 
 RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton();
 {noformat}
 When increased to a 60gb heap, the following exception is thrown:
 {noformat}
   1 java.lang.IllegalArgumentException: requested array size 2147483624 
 exceeds maximum array in java (2147483623)
   1 
 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0)
   1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168)
   1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295)
   1 
 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639)
   1 
 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741)
   1 
 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62)
   1 
 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51)
   1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477)
   1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use

2014-11-03 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195065#comment-14195065
 ] 

Nik Everett commented on LUCENE-6046:
-

I'll certainly add the regexp string to the exception message.  And I'll merge 
the toStringTree from your patch into mine if you'd like.

Yeah - QueryParserBase should have this option too.

The thing I found most useful for debugging this was to call toDot on the 
automata before and after normalization.  I just looked at it and went, oh, of 
course you have to do it that way.  No wonder the states explode.  And then I 
read https://en.wikipedia.org/wiki/Powerset_construction and remembered it from 
my rusty CS degree.

 RegExp.toAutomaton high memory use
 --

 Key: LUCENE-6046
 URL: https://issues.apache.org/jira/browse/LUCENE-6046
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/queryparser
Affects Versions: 4.10.1
Reporter: Lee Hinman
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-6046.patch, LUCENE-6046.patch


 When creating an automaton from an org.apache.lucene.util.automaton.RegExp, 
 it's possible for the automaton to use so much memory it exceeds the maximum 
 array size for java.
 The following caused an OutOfMemoryError with a 32gb heap:
 {noformat}
 new 
 RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton();
 {noformat}
 When increased to a 60gb heap, the following exception is thrown:
 {noformat}
   1 java.lang.IllegalArgumentException: requested array size 2147483624 
 exceeds maximum array in java (2147483623)
   1 
 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0)
   1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168)
   1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295)
   1 
 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639)
   1 
 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741)
   1 
 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62)
   1 
 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51)
   1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477)
   1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6046) RegExp.toAutomaton high memory use

2014-11-03 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-6046:

Attachment: LUCENE-6046.patch

Next version with fixes based on Mike's feedback.

 RegExp.toAutomaton high memory use
 --

 Key: LUCENE-6046
 URL: https://issues.apache.org/jira/browse/LUCENE-6046
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/queryparser
Affects Versions: 4.10.1
Reporter: Lee Hinman
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch


 When creating an automaton from an org.apache.lucene.util.automaton.RegExp, 
 it's possible for the automaton to use so much memory it exceeds the maximum 
 array size for java.
 The following caused an OutOfMemoryError with a 32gb heap:
 {noformat}
 new 
 RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton();
 {noformat}
 When increased to a 60gb heap, the following exception is thrown:
 {noformat}
   1 java.lang.IllegalArgumentException: requested array size 2147483624 
 exceeds maximum array in java (2147483623)
   1 
 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0)
   1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168)
   1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295)
   1 
 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639)
   1 
 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741)
   1 
 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62)
   1 
 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51)
   1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477)
   1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4556) FuzzyTermsEnum creates tons of objects

2014-05-28 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011014#comment-14011014
 ] 

Nik Everett commented on LUCENE-4556:
-

I'm having GC trouble and I'm using the DirectCandidateGenerator.  Its 
obviously kind of hard to tell how much the automata is contributing in 
production but when I try it locally just generating the automata for two or 
three terms takes about 200KB of memory.  Napkin math (200KB * 
250queries/second) says this makes about 50MB of garbage per second per index.  
Obviously it gets worse if you run this in a sharded context where each shard 
does the generating.  Well, not really worse, but the large up front cost and 
memory consumption of this process is relatively static based on shard size so 
this becomes a reason to use larger shards. 

I should propose that in addition to Simon's patches another other option is to 
try to implement something like the stack based automaton simulation that the 
Schulz Mihov paper (the one that proposed the Lev automaton) describes in 
section 6.  Its not useful for stuff like intersecting the enums but if you are 
willing to forgo that you could probably get away with much less memory 
consumption.

 FuzzyTermsEnum creates tons of objects
 --

 Key: LUCENE-4556
 URL: https://issues.apache.org/jira/browse/LUCENE-4556
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/search, modules/spellchecker
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Michael McCandless
Priority: Critical
 Fix For: 4.9, 5.0

 Attachments: LUCENE-4556.patch, LUCENE-4556.patch


 I ran into this problem in production using the DirectSpellchecker. The 
 number of objects created by the spellchecker shoot through the roof very 
 very quickly. We ran about 130 queries and ended up with  2M transitions / 
 states. We spend 50% of the time in GC just because of transitions. Other 
 parts of the system behave just fine here.
 I talked quickly to robert and gave a POC a shot providing a 
 LevenshteinAutomaton#toRunAutomaton(prefix, n) method to optimize this case 
 and build a array based strucuture converted into UTF-8 directly instead of 
 going through the object based APIs. This involved quite a bit of changes but 
 they are all package private at this point. I have a patch that still has a 
 fair set of nocommits but its shows that its possible and IMO worth the 
 trouble to make this really useable in production. All tests pass with the 
 patch - its a start



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5452) Combine matches from multiple fields into one with the postings highlighter

2014-02-18 Thread Nik Everett (JIRA)
Nik Everett created LUCENE-5452:
---

 Summary: Combine matches from multiple fields into one with the 
postings highlighter
 Key: LUCENE-5452
 URL: https://issues.apache.org/jira/browse/LUCENE-5452
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/search
Reporter: Nik Everett
Priority: Minor


Like you can do with the FVH, it'd be nice to be able combine matches from 
multiple fields with the postings highlighter.

Note that the postings highlighter doesn't do phrase matching and doesn't use 
term boosts so some of the FVH's field combining features won't work.  It'd be 
nice to get some of them, though.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5452) Combine matches from multiple fields into one with the postings highlighter

2014-02-18 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904610#comment-13904610
 ] 

Nik Everett commented on LUCENE-5452:
-

I hadn't really thought of doing it a level above.  I like the idea.  The only 
thing that jumps out at me about doing it this way is that there is only a 
single priority queue rather than multiple that have to be maintained and 
merged.  I'm not sure if that outweighs the extra api complexity this adds.  
I'm also pretty sure the higher level approach is more likely to keep the 
careful linear reads that the PostingsHighlighter does.   

 Combine matches from multiple fields into one with the postings highlighter
 ---

 Key: LUCENE-5452
 URL: https://issues.apache.org/jira/browse/LUCENE-5452
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/search
Reporter: Nik Everett
Priority: Minor
 Attachments: LUCENE-5452.patch


 Like you can do with the FVH, it'd be nice to be able combine matches from 
 multiple fields with the postings highlighter.
 Note that the postings highlighter doesn't do phrase matching and doesn't use 
 term boosts so some of the FVH's field combining features won't work.  It'd 
 be nice to get some of them, though.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens

2014-02-11 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5437:


Attachment: (was: LUCENE-5437.patch)

 ASCIIFoldingFilter that emits both unfolded and folded tokens
 -

 Key: LUCENE-5437
 URL: https://issues.apache.org/jira/browse/LUCENE-5437
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Nik Everett
Assignee: Simon Willnauer
Priority: Minor
 Attachments: LUCENE-5437.patch


 I've found myself wanting an ASCIIFoldingFilter that emits both the folded 
 tokens and the original, unfolded tokens.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens

2014-02-11 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5437:


Attachment: LUCENE-5437.patch

Uploading new diff with changes Simon asked for.

 ASCIIFoldingFilter that emits both unfolded and folded tokens
 -

 Key: LUCENE-5437
 URL: https://issues.apache.org/jira/browse/LUCENE-5437
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Nik Everett
Assignee: Simon Willnauer
Priority: Minor
 Attachments: LUCENE-5437.patch


 I've found myself wanting an ASCIIFoldingFilter that emits both the folded 
 tokens and the original, unfolded tokens.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens

2014-02-10 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5437:


Attachment: (was: LUCENE-5437.patch)

 ASCIIFoldingFilter that emits both unfolded and folded tokens
 -

 Key: LUCENE-5437
 URL: https://issues.apache.org/jira/browse/LUCENE-5437
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Nik Everett
Assignee: Simon Willnauer
Priority: Minor
 Attachments: LUCENE-5437.patch


 I've found myself wanting an ASCIIFoldingFilter that emits both the folded 
 tokens and the original, unfolded tokens.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens

2014-02-07 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13894500#comment-13894500
 ] 

Nik Everett commented on LUCENE-5437:
-

I thought about that but my instinct was that duplicating with the keyword 
attribute would add overhead in the case where there aren't characters to fold 
which is by far the more common case.  I think you'd also have to make 
supporting the keyword attribute optional so it wouldn't break backwards 
compatibility.  I figured optionally supporting the keyword attribute would be 
about the same amount of work/code as only duplicating when required so I went 
that way.  I went with adding the extra class and moving the real 
implementation to an absract base class more out of desire to be minimally 
invasive to the original then anything technical.

 ASCIIFoldingFilter that emits both unfolded and folded tokens
 -

 Key: LUCENE-5437
 URL: https://issues.apache.org/jira/browse/LUCENE-5437
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Nik Everett
Assignee: Simon Willnauer
Priority: Minor
 Attachments: LUCENE-5437.patch


 I've found myself wanting an ASCIIFoldingFilter that emits both the folded 
 tokens and the original, unfolded tokens.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens

2014-02-07 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13894563#comment-13894563
 ] 

Nik Everett commented on LUCENE-5437:
-

I suppose I'm just used to abstract classes but you are right, the delegate 
would work better here.  I'll make that change.  Before I do, though, does my 
argument (more instinct, really) about only cloning the token if there is 
anything to fold make sense?  If not I'll just add support for the keyword 
attribute with a version check.

 ASCIIFoldingFilter that emits both unfolded and folded tokens
 -

 Key: LUCENE-5437
 URL: https://issues.apache.org/jira/browse/LUCENE-5437
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Nik Everett
Assignee: Simon Willnauer
Priority: Minor
 Attachments: LUCENE-5437.patch


 I've found myself wanting an ASCIIFoldingFilter that emits both the folded 
 tokens and the original, unfolded tokens.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens

2014-02-07 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5437:


Attachment: LUCENE-5437.patch

Patch that uses a simple boolean rather than crazy subclassing.

 ASCIIFoldingFilter that emits both unfolded and folded tokens
 -

 Key: LUCENE-5437
 URL: https://issues.apache.org/jira/browse/LUCENE-5437
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Nik Everett
Assignee: Simon Willnauer
Priority: Minor
 Attachments: LUCENE-5437.patch, LUCENE-5437.patch


 I've found myself wanting an ASCIIFoldingFilter that emits both the folded 
 tokens and the original, unfolded tokens.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens

2014-02-07 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5437:


Attachment: (was: LUCENE-5437.patch)

 ASCIIFoldingFilter that emits both unfolded and folded tokens
 -

 Key: LUCENE-5437
 URL: https://issues.apache.org/jira/browse/LUCENE-5437
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Nik Everett
Assignee: Simon Willnauer
Priority: Minor
 Attachments: LUCENE-5437.patch, LUCENE-5437.patch


 I've found myself wanting an ASCIIFoldingFilter that emits both the folded 
 tokens and the original, unfolded tokens.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens

2014-02-07 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5437:


Attachment: LUCENE-5437.patch

Minor improvement in the names of things in the tests.

 ASCIIFoldingFilter that emits both unfolded and folded tokens
 -

 Key: LUCENE-5437
 URL: https://issues.apache.org/jira/browse/LUCENE-5437
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Nik Everett
Assignee: Simon Willnauer
Priority: Minor
 Attachments: LUCENE-5437.patch, LUCENE-5437.patch


 I've found myself wanting an ASCIIFoldingFilter that emits both the folded 
 tokens and the original, unfolded tokens.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens

2014-02-06 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5437:


Priority: Minor  (was: Major)

 ASCIIFoldingFilter that emits both unfolded and folded tokens
 -

 Key: LUCENE-5437
 URL: https://issues.apache.org/jira/browse/LUCENE-5437
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Nik Everett
Priority: Minor

 I've found myself wanting an ASCIIFoldingFilter that emits both the folded 
 tokens and the original, unfolded tokens.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens

2014-02-06 Thread Nik Everett (JIRA)
Nik Everett created LUCENE-5437:
---

 Summary: ASCIIFoldingFilter that emits both unfolded and folded 
tokens
 Key: LUCENE-5437
 URL: https://issues.apache.org/jira/browse/LUCENE-5437
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Nik Everett


I've found myself wanting an ASCIIFoldingFilter that emits both the folded 
tokens and the original, unfolded tokens.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens

2014-02-06 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5437:


Attachment: LUCENE-5437.patch

Sorry for moving so much code.

 ASCIIFoldingFilter that emits both unfolded and folded tokens
 -

 Key: LUCENE-5437
 URL: https://issues.apache.org/jira/browse/LUCENE-5437
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Nik Everett
Priority: Minor
 Attachments: LUCENE-5437.patch


 I've found myself wanting an ASCIIFoldingFilter that emits both the folded 
 tokens and the original, unfolded tokens.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5435) CommonTermsQuery should be able to query fields other than the one used as a source of commonness

2014-02-05 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5435:


Attachment: LUCENE-5435.patch

 CommonTermsQuery should be able to query fields other than the one used as a 
 source of commonness
 -

 Key: LUCENE-5435
 URL: https://issues.apache.org/jira/browse/LUCENE-5435
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Nik Everett
 Attachments: LUCENE-5435.patch


 It'd be wonderful if I could use the commonness of one term, say the 
 contents of a document, to power a search across both the document and its 
 title.  Continuing the metaphor, I'd like be able to build a query like this:
 the first
 that is rewritten into: 
 (title:the OR body:the) +(title:first OR body:first)
 with the help of the CommonTermsQuery logic.  Essentially, I'd like 
 CommonTermsQuery to soften the implicit AND for the into and OR because it 
 is common.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5435) CommonTermsQuery should be able to query fields other than the one used as a source of commonness

2014-02-05 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5435:


Priority: Minor  (was: Major)

 CommonTermsQuery should be able to query fields other than the one used as a 
 source of commonness
 -

 Key: LUCENE-5435
 URL: https://issues.apache.org/jira/browse/LUCENE-5435
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Nik Everett
Priority: Minor
 Attachments: LUCENE-5435.patch


 It'd be wonderful if I could use the commonness of one term, say the 
 contents of a document, to power a search across both the document and its 
 title.  Continuing the metaphor, I'd like be able to build a query like this:
 the first
 that is rewritten into: 
 (title:the OR body:the) +(title:first OR body:first)
 with the help of the CommonTermsQuery logic.  Essentially, I'd like 
 CommonTermsQuery to soften the implicit AND for the into and OR because it 
 is common.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5435) CommonTermsQuery should be able to query fields other than the one used as a source of commonness

2014-02-05 Thread Nik Everett (JIRA)
Nik Everett created LUCENE-5435:
---

 Summary: CommonTermsQuery should be able to query fields other 
than the one used as a source of commonness
 Key: LUCENE-5435
 URL: https://issues.apache.org/jira/browse/LUCENE-5435
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Nik Everett
 Attachments: LUCENE-5435.patch

It'd be wonderful if I could use the commonness of one term, say the contents 
of a document, to power a search across both the document and its title.  
Continuing the metaphor, I'd like be able to build a query like this:
the first
that is rewritten into: 
(title:the OR body:the) +(title:first OR body:first)
with the help of the CommonTermsQuery logic.  Essentially, I'd like 
CommonTermsQuery to soften the implicit AND for the into and OR because it is 
common.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5361) FVH throws away some boosts

2014-01-08 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865566#comment-13865566
 ] 

Nik Everett commented on LUCENE-5361:
-

Wonderful!  Thanks.

 FVH throws away some boosts
 ---

 Key: LUCENE-5361
 URL: https://issues.apache.org/jira/browse/LUCENE-5361
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.6.1

 Attachments: LUCENE-5361.patch


 The FVH's FieldQuery throws away some boosts when flattening queries, 
 including DisjunctionMaxQuery and BooleanQuery queries.   Fragments generated 
 against queries containing boosted boolean queries don't end up sorted 
 correctly.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5361) FVH throws away some boosts

2013-12-06 Thread Nik Everett (JIRA)
Nik Everett created LUCENE-5361:
---

 Summary: FVH throws away some boosts
 Key: LUCENE-5361
 URL: https://issues.apache.org/jira/browse/LUCENE-5361
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Nik Everett
Priority: Minor


The FVH's FieldQuery throws away some boosts when flattening queries, including 
DisjunctionMaxQuery and BooleanQuery queries.   Fragments generated against 
queries containing boosted boolean queries don't end up sorted correctly.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5361) FVH throws away some boosts

2013-12-06 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5361:


Attachment: LUCENE-5361.patch

Fix the issue by pushing boosts from parent queries to child queries when the 
parent queries are flattened.  I clone the child queries before setting their 
boost so I don't break anything that expects them unchanged.  I'm not super 
happy that I have to clone the queries but it seemed like the simplest solution.

 FVH throws away some boosts
 ---

 Key: LUCENE-5361
 URL: https://issues.apache.org/jira/browse/LUCENE-5361
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Nik Everett
Priority: Minor
 Attachments: LUCENE-5361.patch


 The FVH's FieldQuery throws away some boosts when flattening queries, 
 including DisjunctionMaxQuery and BooleanQuery queries.   Fragments generated 
 against queries containing boosted boolean queries don't end up sorted 
 correctly.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5285) FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields

2013-11-27 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5285:


Attachment: LUCENE-5285.patch

Ah!  += yeah.  This fixes it and improves the test so it would notice the 
difference.

 FastVectorHighlighter copies segments scores when splitting segments across 
 multi-valued fields
 ---

 Key: LUCENE-5285
 URL: https://issues.apache.org/jira/browse/LUCENE-5285
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Nik Everett
Priority: Minor
 Attachments: LUCENE-5285.patch, LUCENE-5285.patch


 FastVectorHighlighter copies segments scores when splitting segments across 
 multi-valued fields.  This is only a problem when you want to sort the 
 fragments by score. Technically BaseFragmentsBuilder (line 261 in my copy of 
 the source) does the copying.
 Rather than copying the score I _think_ it'd be more right to pull that 
 copying logic into a protected method that child classes (such as 
 ScoreOrderFragmentsBuilder) can override to do more intelligent things.  
 Exactly what that means isn't clear to me at the moment.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5285) FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields

2013-11-05 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5285:


Attachment: (was: LUCENE-5285.patch)

 FastVectorHighlighter copies segments scores when splitting segments across 
 multi-valued fields
 ---

 Key: LUCENE-5285
 URL: https://issues.apache.org/jira/browse/LUCENE-5285
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Nik Everett
Priority: Minor
 Attachments: LUCENE-5285.patch


 FastVectorHighlighter copies segments scores when splitting segments across 
 multi-valued fields.  This is only a problem when you want to sort the 
 fragments by score. Technically BaseFragmentsBuilder (line 261 in my copy of 
 the source) does the copying.
 Rather than copying the score I _think_ it'd be more right to pull that 
 copying logic into a protected method that child classes (such as 
 ScoreOrderFragmentsBuilder) can override to do more intelligent things.  
 Exactly what that means isn't clear to me at the moment.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5285) FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields

2013-10-25 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5285:


Attachment: LUCENE-5285.patch

New patch fixes my broken WeightedFragList change and expands  
WeightedFragListBuilderTest to catch the broken implementation.

 FastVectorHighlighter copies segments scores when splitting segments across 
 multi-valued fields
 ---

 Key: LUCENE-5285
 URL: https://issues.apache.org/jira/browse/LUCENE-5285
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Nik Everett
Priority: Minor
 Attachments: LUCENE-5285.patch, LUCENE-5285.patch


 FastVectorHighlighter copies segments scores when splitting segments across 
 multi-valued fields.  This is only a problem when you want to sort the 
 fragments by score. Technically BaseFragmentsBuilder (line 261 in my copy of 
 the source) does the copying.
 Rather than copying the score I _think_ it'd be more right to pull that 
 copying logic into a protected method that child classes (such as 
 ScoreOrderFragmentsBuilder) can override to do more intelligent things.  
 Exactly what that means isn't clear to me at the moment.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5285) FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields

2013-10-23 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13802776#comment-13802776
 ] 

Nik Everett commented on LUCENE-5285:
-

I realized last night that I did the WeightedFragList incorrectly in that 
patch.  I'll upload another one as time permits.

 FastVectorHighlighter copies segments scores when splitting segments across 
 multi-valued fields
 ---

 Key: LUCENE-5285
 URL: https://issues.apache.org/jira/browse/LUCENE-5285
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Nik Everett
Priority: Minor
 Attachments: LUCENE-5285.patch


 FastVectorHighlighter copies segments scores when splitting segments across 
 multi-valued fields.  This is only a problem when you want to sort the 
 fragments by score. Technically BaseFragmentsBuilder (line 261 in my copy of 
 the source) does the copying.
 Rather than copying the score I _think_ it'd be more right to pull that 
 copying logic into a protected method that child classes (such as 
 ScoreOrderFragmentsBuilder) can override to do more intelligent things.  
 Exactly what that means isn't clear to me at the moment.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5285) FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields

2013-10-22 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5285:


Attachment: LUCENE-5285.patch

This adds a boost member to FieldFragLists' SubInfo which is its contribution 
to the WeightedFragInfo's boost.  When splitting WeightedFragInfo across fields 
the new info's score is the sum of the scores of all SubInfos it contains.

 FastVectorHighlighter copies segments scores when splitting segments across 
 multi-valued fields
 ---

 Key: LUCENE-5285
 URL: https://issues.apache.org/jira/browse/LUCENE-5285
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Nik Everett
Priority: Minor
 Attachments: LUCENE-5285.patch


 FastVectorHighlighter copies segments scores when splitting segments across 
 multi-valued fields.  This is only a problem when you want to sort the 
 fragments by score. Technically BaseFragmentsBuilder (line 261 in my copy of 
 the source) does the copying.
 Rather than copying the score I _think_ it'd be more right to pull that 
 copying logic into a protected method that child classes (such as 
 ScoreOrderFragmentsBuilder) can override to do more intelligent things.  
 Exactly what that means isn't clear to me at the moment.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-20 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5274:


Attachment: LUCENE-5274.patch

Change codestyle on MergedIterator changes and TestMergedIterator to match the 
style in the rest of core.  The FVH changes still use the wide style prevalent 
in the FVH code.
Also, sort fewer numbers in TestMergedIterator to make it faster.  The only 
reason I was sorting so many the first time around was to get a good sense of 
what I was doing to the speed by adding the additional conditional.

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-20 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5274:


Attachment: (was: LUCENE-5274.patch)

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-18 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5274:


Attachment: LUCENE-5274.clean.patch

Attached a version of the patch that applies cleanly but doesn't clearly show 
the changes to MergedIterator.  I built it by svn rm and svn add rather than 
svn mv + edit.

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.clean.patch, LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-18 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5274:


Attachment: (was: LUCENE-5274.clean.patch)

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-18 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5274:


Attachment: (was: LUCENE-5274.patch)

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-18 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5274:


Attachment: LUCENE-5274.patch

Finally switch text to generated on the fly.  No other changes.  Patch _should_ 
apply cleanly but like the last one doesn't clearly show what I changed in 
MergedIterator.

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-17 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798512#comment-13798512
 ] 

Nik Everett commented on LUCENE-5274:
-

{quote}
 (it removed MergedIterator.java)
{quote}
It was supposed to move it to the util package.  I'll figure out what happened 
there.

I agree with the other points but it is worth discussing the last one.  The 
others I'll just make the changes you mention.

I intentionally didn't update text in WeightedPhraseInfo.merge because it is 
documented as being for debugging so it didn't seem worth the cost.  Would it 
make sense to remove the member entirely and generate it from stored terms when 
needed?

It also doesn't update seqnum mostly because I really don't know the right way 
to update it.

As for WeightedPhraseInfo's immutability - I didn't see any final members so 
setting up the state in the constructor and not having setters just looked more 
like it wanted to encapsulate logic rather than immutability.  I'll swap the 
merge method with a merging constructor.

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-17 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5274:


Attachment: LUCENE-5274.patch

Fix all issues exception the text on WeightedPhraseInfo.  If we're ok with 
building it on the fly then I'll get to that in the morning.

I can't get the patch to apply cleanly - something to do with moving a file and 
then changing its contents.  The closest I can come is:
 svn mv lucene/core/src/java/org/apache/lucene/index/MergedIterator.java 
lucene/core/src/java/org/apache/lucene/util/
 patch -f -p0  ~/LUCENE-5274.patch
 svn add lucene/core/src/test/org/apache/lucene/util/TestMergedIterator.java

I'm sure there is a better way to do this.  If you get the chance please let me 
know.

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-17 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5274:


Attachment: (was: LUCENE-5274.patch)

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-14 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794190#comment-13794190
 ] 

Nik Everett commented on LUCENE-5274:
-

I'm having a look at what I can do to pull MergedIterator into the util package 
and give it nice unit tests.  Almost done with that and I should be able to 
spin another version of this patch.  I'm not exactly sure of a good way to test 
the synonym stuff in FastVectorHighlighterTest - I don't see a mock Synonym 
filter.

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-14 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5274:


Attachment: (was: LUCENE-5274.patch)

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-14 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5274:


Attachment: LUCENE-5274.patch

Not done yet but progress:
1.  Move MergedIterator to util.
2.  Add a mode to it to not remove duplicates (one extra branch per call to 
next).
3.  Add a unit test for MergedIterator.
4.  Make FieldTermStack.TermInfo, FieldPhraseList.WeighterPhraseInfo, 
FieldPhraseList.WeightedPhraseInfo.Toffs consistent under equals, hashCode, and 
compareTo.  I don't think any of them would make good hash keys but I fixed up 
hashCode because I fixed up equals.
5.  Unit tests for point 4.
7.  Use the non-duplicate removing mode of MergedIterator in FieldPhraseList's 
merge methods.
6.  More tests in FastVectorHighlighterTest - mostly around exact equal matches 
and how they effect segment sorting.

At this point this is left:
1.  Unit tests for equal matches in the same FieldPhraseList.
2.  Poke around with corner cases during merges.  Test them in 
FastVectorHighlighterTest if they reflect mockable real world cases.  Expand 
FieldPhraseListTest if they don't.
4.  Remove highlighter dependency on analyzer module.  Would it make sense to 
move PerFieldAnalyzerWrapper into core?  Something else?
3.  Anything else from review.

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-14 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5274:


Attachment: (was: LUCENE-5274.patch)

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-14 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5274:


Attachment: LUCENE-5274.patch

Removed analyzer dependency and added tests covering synonyms.

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-11 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5274:


Attachment: LUCENE-5274-4.patch

Reworked to remove dependency on query parser and most of the analyzer 
dependency and to fix errors with phrases.  It'll need to lose the rest of the 
analyzer dependency and have more test cases in addition to any other concerns 
raised in the review. 

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274-4.patch, LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-11 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5274:


Attachment: (was: LUCENE-5274.patch)

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor

 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-11 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5274:


Attachment: (was: LUCENE-5274-4.patch)

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor

 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-11 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5274:


Attachment: LUCENE-5274.patch

New version of the patch.  This one works a lot better with phrases and even 
works on fields that have the same source but different tokenizers.

It still makes highlighting depend on the analysis module to pick up 
PerFieldAnalyzerWrapper.

I think all the new code this adds to FieldPhraseList deserves a unit test on 
its own but I'm not in the frame of mind to write one at the moment so I'll 
have to come back to it later.

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-11 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13792913#comment-13792913
 ] 

Nik Everett commented on LUCENE-5274:
-

Hey, forgot to mention that.  MockTokenizer seems to throw away the character 
after the end of each token even if that character is the valid start to the 
next token.  This comes up because I wanted to tokenize strings in a simplistic 
way to test that the highlighter can handle different tokenizers and it just 
wasn't working right.  So I fixed MockTokenizer but I did it in a pretty 
brutal way.  I'm happy to move the change to another bug and improve it but 
testing the highlighter change without it is a bit painful.

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5278) MockTokenizer throws away the character right after a token even if it is a valid start to a new token

2013-10-11 Thread Nik Everett (JIRA)
Nik Everett created LUCENE-5278:
---

 Summary: MockTokenizer throws away the character right after a 
token even if it is a valid start to a new token
 Key: LUCENE-5278
 URL: https://issues.apache.org/jira/browse/LUCENE-5278
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Nik Everett
Priority: Trivial


MockTokenizer throws away the character right after a token even if it is a 
valid start to a new token.  You won't see this unless you build a tokenizer 
that can recognize every character like with new RegExp(.) or RegExp(...).

Changing this behaviour seems to break a number of tests.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-11 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13792974#comment-13792974
 ] 

Nik Everett commented on LUCENE-5274:
-

Filed LUCENE-5278.

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5278) MockTokenizer throws away the character right after a token even if it is a valid start to a new token

2013-10-11 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5278:


Attachment: LUCENE-5278.patch

This patch fixes the behaviour from my perspective but breaks a bunch of 
other tests.

 MockTokenizer throws away the character right after a token even if it is a 
 valid start to a new token
 --

 Key: LUCENE-5278
 URL: https://issues.apache.org/jira/browse/LUCENE-5278
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Nik Everett
Priority: Trivial
 Attachments: LUCENE-5278.patch


 MockTokenizer throws away the character right after a token even if it is a 
 valid start to a new token.  You won't see this unless you build a tokenizer 
 that can recognize every character like with new RegExp(.) or RegExp(...).
 Changing this behaviour seems to break a number of tests.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-11 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13793018#comment-13793018
 ] 

Nik Everett commented on LUCENE-5274:
-

{quote}
I can see the possible use case here, but I think it deserves some discussion 
first (versus just making it public).
{quote}
Sure!  I'm more used to Guava's tools so I think I was lulled in to a false 
sense of recognition.  No chance of updating to a modern version of Guava?:)

{quote}
This thing has limitations (its currently only used by indexwriter for 
buffereddeletes, its basically like a MultiTerms over an Iterator). For example 
each iterator it consumes should not have duplicate values according to its 
compareTo(): its not clear to me this WeightedPhraseInfo behaves this way
{quote}
Yikes!  I didn't catch that but now that you point it out it is right there in 
the docs and I should have.  WeightedPhraseInfo doesn't behave that way and 

{quote}
Furthermore the class in question (WeightedPhraseInfo) is public, and adding 
Comparable to it looks like it will create a situation where its inconsistent 
with equals()... I think this is a little dangerous.
{quote}
I agree on the inconsistent with inconsistent with equals.  I can either fix 
that or use a Comparator for sorting both WeightedPhraseInfo and Toffs.  That'd 
require a MergeSorter that can take one but 

{quote}
If it turns out we can reuse it: great! But i think rather than just slapping 
public on it, we should move it to .util, ensure it has good javadocs and unit 
tests, and investigate what exactly happens when these contracts are violated: 
e.g. can we make an exception happen rather than just broken behavior in a way 
that won't hurt performance and so on?
{quote}
Makes sense to me.

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-11 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13793038#comment-13793038
 ] 

Nik Everett commented on LUCENE-5274:
-

{{quote}}
There is no lucene dependency on guava. I don't think we should introduce one, 
and it wouldnt solve the issues i mentioned anyway (e.g. comparable 
inconsistent with equals and stuff). It would only add 2.1MB of bloated 
unnecessary syntactic sugar (sorry, thats just my opinion on it, i think its 
useless).

We should keep our third party dependencies minimal and necessary so that any 
app using lucene can choose for itself what version of this stuff (if any) it 
wants to use. If we rely upon unnecessary stuff it hurts the end user by 
forcing them to compatible versions.
{{quote}}
I figured that was the reasoning and I don't intend to argue with it.  In this 
case it would provide a method to merge sorted iterators just like 
MergedIterator only without the caveats around duplication but I'm happy to 
work around it.  Guava certainly wouldn't fix my forgetting equals and hashcode.

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-10 Thread Nik Everett (JIRA)
Nik Everett created LUCENE-5274:
---

 Summary: Teach fast FastVectorHighlighter to highlight child 
fields with parent fields
 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Priority: Minor


I've been messing around with the FastVectorHighlighter and it looks like I can 
teach it to highlight matches on child fields.  Like this query:
foo:scissors foo_exact:running
would highlight foo like this:
emrunning/em with emscissors/em

Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of 
foo a different analyzer and its own WITH_POSITIONS_OFFSETS.

This would make queries that perform weighted matches against different 
analyzers much more convenient to highlight.

I have working code and test cases but they are hacked into Elasticsearch.  I'd 
love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-10 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5274:


Attachment: LUCENE-5274.patch

Patch implementing merging highlights on child fields.

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-10 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13791715#comment-13791715
 ] 

Nik Everett commented on LUCENE-5274:
-

I've uploaded a patch for this.  I made the highlighter module depend on the 
query string parser and analyzer modules for testing.  I probably could have 
gotten away without the query string parser but it made the test cases simpler 
to write.  The analyzer module was required to analyze different fields with 
different analyzers which is kind of the point of this feature.  My ant-foo is 
too weak for me to be sure I didn't set up some kind of horrible circular 
dependency that hasn't hit me.

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields

2013-10-10 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13791945#comment-13791945
 ] 

Nik Everett commented on LUCENE-5274:
-

 We tend to avoid doing that in order to not have cross or circular 
 dependencies between modules. This is not an issue at this stage of the patch 
 but we should try replacing the analysis components you are using with 
 MockAnalyzer at some point.

The PerFieldAnalyzerWrapper was the thing that pulled me there.  I'd appreciate 
some tips on how to work around that.  I'll have a look at removing the query 
parser dependency.  I'm also using the EnglishAnalzyer but I'm just using that 
to have a third analyzer in the mix.  I'll see about using MockAnalzyer for 
that too.

 I only had a quick look at the patch so far and I'm a bit unsure about 
 childFields. Maybe it would be better API-wise to specify the stored field 
 and the index fields separately? Or maybe to retrieve the index fields from 
 the terms of the query? What do you think?

I don't like retrieving the indexed fields from the query - what if you don't 
want them all?  how can you make sure that the ones that you take from the 
query really do share the same stored copy.

As far as calling out the stored field and the indexed field separately - I 
think I like the idea.  It'd let you load the source from a field that isn't 
actively being highlighted.  I'll have a look at that.

 Teach fast FastVectorHighlighter to highlight child fields with parent 
 fields
 ---

 Key: LUCENE-5274
 URL: https://issues.apache.org/jira/browse/LUCENE-5274
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Nik Everett
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5274.patch


 I've been messing around with the FastVectorHighlighter and it looks like I 
 can teach it to highlight matches on child fields.  Like this query:
 foo:scissors foo_exact:running
 would highlight foo like this:
 emrunning/em with emscissors/em
 Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy 
 of foo a different analyzer and its own WITH_POSITIONS_OFFSETS.
 This would make queries that perform weighted matches against different 
 analyzers much more convenient to highlight.
 I have working code and test cases but they are hacked into Elasticsearch.  
 I'd love to Lucene-ify if you'll take them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5245) ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything before query weight is calculated

2013-09-26 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778646#comment-13778646
 ] 

Nik Everett commented on LUCENE-5245:
-

Thanks for jumping on this so quickly!

 ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything 
 before query weight is calculated
 -

 Key: LUCENE-5245
 URL: https://issues.apache.org/jira/browse/LUCENE-5245
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.4
Reporter: Nik Everett
Assignee: Uwe Schindler
 Fix For: 5.0, 4.6

 Attachments: LUCENE-5245.patch, LUCENE-5245.patch, LUCENE-5245.patch


 ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything 
 before query weight is calculated.  This dramatically changes the resulting 
 score which is bad when comparing scores across different Lucene 
 indexes/shards/whatever.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5245) ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything before query weight is calculated

2013-09-25 Thread Nik Everett (JIRA)
Nik Everett created LUCENE-5245:
---

 Summary: ConstantScoreAutoRewrite rewrites prefix queryies that 
don't match anything before query weight is calculated
 Key: LUCENE-5245
 URL: https://issues.apache.org/jira/browse/LUCENE-5245
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.4
Reporter: Nik Everett


ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything 
before query weight is calculated.  This dramatically changes the resulting 
score which is bad when comparing scores across different Lucene 
indexes/shards/whatever.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5245) ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything before query weight is calculated

2013-09-25 Thread Nik Everett (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778093#comment-13778093
 ] 

Nik Everett commented on LUCENE-5245:
-

The query norm applied to the constant score query changes.  Say I had a query 
string like foo:findm*^20 bar:findm* and only foo had a result on shard 1 and 
only bar had a result shard 2.  Both end up with the same score because on 
shard one the query is rewritten to foo:findm*^20 (norm = .05) and 
bar:findm* (norm = 1).

 ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything 
 before query weight is calculated
 -

 Key: LUCENE-5245
 URL: https://issues.apache.org/jira/browse/LUCENE-5245
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.4
Reporter: Nik Everett

 ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything 
 before query weight is calculated.  This dramatically changes the resulting 
 score which is bad when comparing scores across different Lucene 
 indexes/shards/whatever.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5245) ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything before query weight is calculated

2013-09-25 Thread Nik Everett (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nik Everett updated LUCENE-5245:


Attachment: LUCENE-5245.patch

This fixes my problem but I'm not sure how to setup unit tests in Lucene.

 ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything 
 before query weight is calculated
 -

 Key: LUCENE-5245
 URL: https://issues.apache.org/jira/browse/LUCENE-5245
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.4
Reporter: Nik Everett
 Attachments: LUCENE-5245.patch


 ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything 
 before query weight is calculated.  This dramatically changes the resulting 
 score which is bad when comparing scores across different Lucene 
 indexes/shards/whatever.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org