[jira] [Commented] (LUCENE-7976) Add a parameter to TieredMergePolicy to merge segments that have more than X percent deleted documents
[ https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188704#comment-16188704 ] Nik Everett commented on LUCENE-7976: - I had this issue on a previous project. Our indices were smaller than what you are talking about but we did have one or two of the max size segments that refused to merge away their deleted documents until they got to 50%. We had a fairly high update rate and a very high query rate. The deleted documents bloated the working set size somewhat causing more IO which was our bottleneck at the time. I would have been happy to pay for the increased merge IO to have lower query time IO. We ultimately solved the problem by throwing money at it. More ram and better SSDs makes life much easier. I would have liked to have solved the problem in software but as an very infrequent contributor I didn't feel like I'd ever get a change to TieredMergePolicy merged. > Add a parameter to TieredMergePolicy to merge segments that have more than X > percent deleted documents > -- > > Key: LUCENE-7976 > URL: https://issues.apache.org/jira/browse/LUCENE-7976 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson > > We're seeing situations "in the wild" where there are very large indexes (on > disk) handled quite easily in a single Lucene index. This is particularly > true as features like docValues move data into MMapDirectory space. The > current TMP algorithm allows on the order of 50% deleted documents as per a > dev list conversation with Mike McCandless (and his blog here: > https://www.elastic.co/blog/lucenes-handling-of-deleted-documents). > Especially in the current era of very large indexes in aggregate, (think many > TB) solutions like "you need to distribute your collection over more shards" > become very costly. Additionally, the tempting "optimize" button exacerbates > the issue since once you form, say, a 100G segment (by > optimizing/forceMerging) it is not eligible for merging until 97.5G of the > docs in it are deleted (current default 5G max segment size). > The proposal here would be to add a new parameter to TMP, something like > (no, that's not serious name, suggestions > welcome) which would default to 100 (or the same behavior we have now). > So if I set this parameter to, say, 20%, and the max segment size stays at > 5G, the following would happen when segments were selected for merging: > > any segment with > 20% deleted documents would be merged or rewritten NO > > MATTER HOW LARGE. There are two cases, > >> the segment has < 5G "live" docs. In that case it would be merged with > >> smaller segments to bring the resulting segment up to 5G. If no smaller > >> segments exist, it would just be rewritten > >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). > >> It would be rewritten into a single segment removing all deleted docs no > >> matter how big it is to start. The 100G example above would be rewritten > >> to an 80G segment for instance. > Of course this would lead to potentially much more I/O which is why the > default would be the same behavior we see now. As it stands now, though, > there's no way to recover from an optimize/forceMerge except to re-index from > scratch. We routinely see 200G-300G Lucene indexes at this point "in the > wild" with 10s of shards replicated 3 or more times. And that doesn't even > include having these over HDFS. > Alternatives welcome! Something like the above seems minimally invasive. A > new merge policy is certainly an alternative. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6334) Fast Vector Highlighter does not properly span neighboring term offsets
[ https://issues.apache.org/jira/browse/LUCENE-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-6334: Attachment: LUCENE-6334.patch Fast Vector Highlighter does not properly span neighboring term offsets --- Key: LUCENE-6334 URL: https://issues.apache.org/jira/browse/LUCENE-6334 Project: Lucene - Core Issue Type: Bug Components: core/termvectors, modules/highlighter Reporter: Chris Earle Labels: easyfix Attachments: LUCENE-6334.patch, LUCENE-6334.patch, LUCENE-6334.patch If you are using term vectors for fast vector highlighting along with a multivalue field while matching a phrase that crosses two elements, then it will not properly highlight even though it _properly_ finds the correct values to highlight. A good example of this is when matching source code, where you might have lines like: {code} one two three five two three four five six five six seven eight nine eight nine eight nine eight nine eight nine eight nine eight nine ten eleven twelve thirteen {code} Matching the phrase four five will return {code} two three four five six five six seven eight nine eight nine eight nine eight nine eight eight nine ten eleven {code} However, it does not properly highlight four (on the first line) and five (on the second line) _and_ it is returning too many lines, but not all of them. The problem lies in the [BaseFragmentsBuilder at line 269| https://github.com/apache/lucene-solr/blob/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/BaseFragmentsBuilder.java#L269] because it is not checking for cross-coverage. Here is a possible solution: {code} boolean started = toffs.getStartOffset() = fieldStart; boolean ended = toffs.getEndOffset() = fieldEnd; // existing behavior: if (started ended) { toffsList.add(toffs); toffsIterator.remove(); } else if (started) { toffsList.add(new Toffs(toffs.getStartOffset(), field.end)); // toffsIterator.remove(); // is this necessary? } else if (ended) { toffsList.add(new Toffs(fieldStart, toff.getEndOffset())); // toffsIterator.remove(); // is this necessary? } else if (toffs.getEndOffset() fieldEnd) { // ie the toff spans whole field toffsList.add(new Toffs(fieldStart, fieldEnd)); // toffsIterator.remove(); // is this necessary? } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6334) Fast Vector Highlighter does not properly span neighboring term offsets
[ https://issues.apache.org/jira/browse/LUCENE-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-6334: Attachment: LUCENE-6334.patch Fast Vector Highlighter does not properly span neighboring term offsets --- Key: LUCENE-6334 URL: https://issues.apache.org/jira/browse/LUCENE-6334 Project: Lucene - Core Issue Type: Bug Components: core/termvectors, modules/highlighter Reporter: Chris Earle Labels: easyfix Attachments: LUCENE-6334.patch, LUCENE-6334.patch If you are using term vectors for fast vector highlighting along with a multivalue field while matching a phrase that crosses two elements, then it will not properly highlight even though it _properly_ finds the correct values to highlight. A good example of this is when matching source code, where you might have lines like: {code} one two three five two three four five six five six seven eight nine eight nine eight nine eight nine eight nine eight nine eight nine ten eleven twelve thirteen {code} Matching the phrase four five will return {code} two three four five six five six seven eight nine eight nine eight nine eight nine eight eight nine ten eleven {code} However, it does not properly highlight four (on the first line) and five (on the second line) _and_ it is returning too many lines, but not all of them. The problem lies in the [BaseFragmentsBuilder at line 269| https://github.com/apache/lucene-solr/blob/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/BaseFragmentsBuilder.java#L269] because it is not checking for cross-coverage. Here is a possible solution: {code} boolean started = toffs.getStartOffset() = fieldStart; boolean ended = toffs.getEndOffset() = fieldEnd; // existing behavior: if (started ended) { toffsList.add(toffs); toffsIterator.remove(); } else if (started) { toffsList.add(new Toffs(toffs.getStartOffset(), field.end)); // toffsIterator.remove(); // is this necessary? } else if (ended) { toffsList.add(new Toffs(fieldStart, toff.getEndOffset())); // toffsIterator.remove(); // is this necessary? } else if (toffs.getEndOffset() fieldEnd) { // ie the toff spans whole field toffsList.add(new Toffs(fieldStart, fieldEnd)); // toffsIterator.remove(); // is this necessary? } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6334) Fast Vector Highlighter does not properly span neighboring term offsets
[ https://issues.apache.org/jira/browse/LUCENE-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-6334: Attachment: LUCENE-6334.patch Test case and fix based on examples and source code provided in problem description. I started with the proposed fix and modified it quite a bit to get something that should get the job done. Also expanded on the proposed test cases to include things like phrases that span entire values. Fast Vector Highlighter does not properly span neighboring term offsets --- Key: LUCENE-6334 URL: https://issues.apache.org/jira/browse/LUCENE-6334 Project: Lucene - Core Issue Type: Bug Components: core/termvectors, modules/highlighter Reporter: Chris Earle Labels: easyfix Attachments: LUCENE-6334.patch, LUCENE-6334.patch If you are using term vectors for fast vector highlighting along with a multivalue field while matching a phrase that crosses two elements, then it will not properly highlight even though it _properly_ finds the correct values to highlight. A good example of this is when matching source code, where you might have lines like: {code} one two three five two three four five six five six seven eight nine eight nine eight nine eight nine eight nine eight nine eight nine ten eleven twelve thirteen {code} Matching the phrase four five will return {code} two three four five six five six seven eight nine eight nine eight nine eight nine eight eight nine ten eleven {code} However, it does not properly highlight four (on the first line) and five (on the second line) _and_ it is returning too many lines, but not all of them. The problem lies in the [BaseFragmentsBuilder at line 269| https://github.com/apache/lucene-solr/blob/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/BaseFragmentsBuilder.java#L269] because it is not checking for cross-coverage. Here is a possible solution: {code} boolean started = toffs.getStartOffset() = fieldStart; boolean ended = toffs.getEndOffset() = fieldEnd; // existing behavior: if (started ended) { toffsList.add(toffs); toffsIterator.remove(); } else if (started) { toffsList.add(new Toffs(toffs.getStartOffset(), field.end)); // toffsIterator.remove(); // is this necessary? } else if (ended) { toffsList.add(new Toffs(fieldStart, toff.getEndOffset())); // toffsIterator.remove(); // is this necessary? } else if (toffs.getEndOffset() fieldEnd) { // ie the toff spans whole field toffsList.add(new Toffs(fieldStart, fieldEnd)); // toffsIterator.remove(); // is this necessary? } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6334) Fast Vector Highlighter does not properly span neighboring term offsets
[ https://issues.apache.org/jira/browse/LUCENE-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-6334: Attachment: (was: LUCENE-6334.patch) Fast Vector Highlighter does not properly span neighboring term offsets --- Key: LUCENE-6334 URL: https://issues.apache.org/jira/browse/LUCENE-6334 Project: Lucene - Core Issue Type: Bug Components: core/termvectors, modules/highlighter Reporter: Chris Earle Labels: easyfix Attachments: LUCENE-6334.patch If you are using term vectors for fast vector highlighting along with a multivalue field while matching a phrase that crosses two elements, then it will not properly highlight even though it _properly_ finds the correct values to highlight. A good example of this is when matching source code, where you might have lines like: {code} one two three five two three four five six five six seven eight nine eight nine eight nine eight nine eight nine eight nine eight nine ten eleven twelve thirteen {code} Matching the phrase four five will return {code} two three four five six five six seven eight nine eight nine eight nine eight nine eight eight nine ten eleven {code} However, it does not properly highlight four (on the first line) and five (on the second line) _and_ it is returning too many lines, but not all of them. The problem lies in the [BaseFragmentsBuilder at line 269| https://github.com/apache/lucene-solr/blob/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/BaseFragmentsBuilder.java#L269] because it is not checking for cross-coverage. Here is a possible solution: {code} boolean started = toffs.getStartOffset() = fieldStart; boolean ended = toffs.getEndOffset() = fieldEnd; // existing behavior: if (started ended) { toffsList.add(toffs); toffsIterator.remove(); } else if (started) { toffsList.add(new Toffs(toffs.getStartOffset(), field.end)); // toffsIterator.remove(); // is this necessary? } else if (ended) { toffsList.add(new Toffs(fieldStart, toff.getEndOffset())); // toffsIterator.remove(); // is this necessary? } else if (toffs.getEndOffset() fieldEnd) { // ie the toff spans whole field toffsList.add(new Toffs(fieldStart, fieldEnd)); // toffsIterator.remove(); // is this necessary? } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6334) Fast Vector Highlighter does not properly span neighboring term offsets
[ https://issues.apache.org/jira/browse/LUCENE-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-6334: Attachment: (was: LUCENE-6334.patch) Fast Vector Highlighter does not properly span neighboring term offsets --- Key: LUCENE-6334 URL: https://issues.apache.org/jira/browse/LUCENE-6334 Project: Lucene - Core Issue Type: Bug Components: core/termvectors, modules/highlighter Reporter: Chris Earle Labels: easyfix Attachments: LUCENE-6334.patch If you are using term vectors for fast vector highlighting along with a multivalue field while matching a phrase that crosses two elements, then it will not properly highlight even though it _properly_ finds the correct values to highlight. A good example of this is when matching source code, where you might have lines like: {code} one two three five two three four five six five six seven eight nine eight nine eight nine eight nine eight nine eight nine eight nine ten eleven twelve thirteen {code} Matching the phrase four five will return {code} two three four five six five six seven eight nine eight nine eight nine eight nine eight eight nine ten eleven {code} However, it does not properly highlight four (on the first line) and five (on the second line) _and_ it is returning too many lines, but not all of them. The problem lies in the [BaseFragmentsBuilder at line 269| https://github.com/apache/lucene-solr/blob/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/BaseFragmentsBuilder.java#L269] because it is not checking for cross-coverage. Here is a possible solution: {code} boolean started = toffs.getStartOffset() = fieldStart; boolean ended = toffs.getEndOffset() = fieldEnd; // existing behavior: if (started ended) { toffsList.add(toffs); toffsIterator.remove(); } else if (started) { toffsList.add(new Toffs(toffs.getStartOffset(), field.end)); // toffsIterator.remove(); // is this necessary? } else if (ended) { toffsList.add(new Toffs(fieldStart, toff.getEndOffset())); // toffsIterator.remove(); // is this necessary? } else if (toffs.getEndOffset() fieldEnd) { // ie the toff spans whole field toffsList.add(new Toffs(fieldStart, fieldEnd)); // toffsIterator.remove(); // is this necessary? } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6334) Fast Vector Highlighter does not properly span neighboring term offsets
[ https://issues.apache.org/jira/browse/LUCENE-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14643330#comment-14643330 ] Nik Everett commented on LUCENE-6334: - Would anyone object to me having a look at this? Fast Vector Highlighter does not properly span neighboring term offsets --- Key: LUCENE-6334 URL: https://issues.apache.org/jira/browse/LUCENE-6334 Project: Lucene - Core Issue Type: Bug Components: core/termvectors, modules/highlighter Reporter: Chris Earle Labels: easyfix If you are using term vectors for fast vector highlighting along with a multivalue field while matching a phrase that crosses two elements, then it will not properly highlight even though it _properly_ finds the correct values to highlight. A good example of this is when matching source code, where you might have lines like: {code} one two three five two three four five six five six seven eight nine eight nine eight nine eight nine eight nine eight nine eight nine ten eleven twelve thirteen {code} Matching the phrase four five will return {code} two three four five six five six seven eight nine eight nine eight nine eight nine eight eight nine ten eleven {code} However, it does not properly highlight four (on the first line) and five (on the second line) _and_ it is returning too many lines, but not all of them. The problem lies in the [BaseFragmentsBuilder at line 269| https://github.com/apache/lucene-solr/blob/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/BaseFragmentsBuilder.java#L269] because it is not checking for cross-coverage. Here is a possible solution: {code} boolean started = toffs.getStartOffset() = fieldStart; boolean ended = toffs.getEndOffset() = fieldEnd; // existing behavior: if (started ended) { toffsList.add(toffs); toffsIterator.remove(); } else if (started) { toffsList.add(new Toffs(toffs.getStartOffset(), field.end)); // toffsIterator.remove(); // is this necessary? } else if (ended) { toffsList.add(new Toffs(fieldStart, toff.getEndOffset())); // toffsIterator.remove(); // is this necessary? } else if (toffs.getEndOffset() fieldEnd) { // ie the toff spans whole field toffsList.add(new Toffs(fieldStart, fieldEnd)); // toffsIterator.remove(); // is this necessary? } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-6054) RegExp.toAutomaton fails on #*
Nik Everett created LUCENE-6054: --- Summary: RegExp.toAutomaton fails on #* Key: LUCENE-6054 URL: https://issues.apache.org/jira/browse/LUCENE-6054 Project: Lucene - Core Issue Type: Bug Reporter: Nik Everett Priority: Minor This throws an assertion error: new RegExp(#*).toAutomaton(1000); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6054) RegExp.toAutomaton fails on #*
[ https://issues.apache.org/jira/browse/LUCENE-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-6054: Attachment: LUCENE-6054.diff RegExp.toAutomaton fails on #* -- Key: LUCENE-6054 URL: https://issues.apache.org/jira/browse/LUCENE-6054 Project: Lucene - Core Issue Type: Bug Reporter: Nik Everett Priority: Minor Attachments: LUCENE-6054.diff This throws an assertion error: new RegExp(#*).toAutomaton(1000); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196070#comment-14196070 ] Nik Everett commented on LUCENE-6046: - A couple of updates: This affects version 4.9 as well. Probably all versions. But its impact is likely minor enough to only be worth adding to the 4.10 line. A found a few test cases that need lots and lots of states. Any time you feed a couple hundred random unicode words to the automata you'll end up needing more than ten thousand states. I've updated those tests to ask for a million states and they caught a few places where I hadn't been as diligent in piping maxDeterminizedStates through. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194484#comment-14194484 ] Nik Everett commented on LUCENE-6046: - I'm working on a first cut of something that does that. Better regex implementation would be great but the biggest thing to me is being able to limit the amount of work the determinize operation performs. Its such a costly operation that I don't think it should ever be really abstracted from the user. Something like having determinize throw a checked exception when it performs too much work would make you keenly aware whenever you might be straying into exponential territory. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194592#comment-14194592 ] Nik Everett commented on LUCENE-6046: - Oh yeah, its totally running into 2^n territory legitiately here. This is totally something that'd be rejected by a framework to prevent explosive growth during determination. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-6046: Attachment: LUCENE-6046.patch First cut at a patch. Adds maxDeterminizedStates to Operations.determinize and pipes it through to tons of places. I think its important never to hide when determinize is called because of how potentially heavy it is. Forcing callers of MinimizationOperations.minimize, Operations.reverse, Operations.minus etc to specify maxDeterminizedStates makes it pretty clear that the automaton might be determinized during those processes. I added an unchecked exception for when the Automaton can't be determinized within the specified number of state but I'm really tempted to change it to a checked exception to make it super duper obvious when determinization might occur. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194716#comment-14194716 ] Nik Everett commented on LUCENE-6046: - Oh - I'm still running the solr tests against this. I imagine they'll pass as they've been running fine for 30 minutes now but I should throw that out there in case someone gets them to fail with this before I do. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195033#comment-14195033 ] Nik Everett commented on LUCENE-6046: - Oh no! I wrote a very similar patch! Sorry to duplicate effort there. I found that 10,000 states wasn't quite enough to handle some of the tests so I went with 1,000,000 as the default. Its pretty darn huge but it does get the job done. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195056#comment-14195056 ] Nik Everett commented on LUCENE-6046: - TestDeterminizeLexicon wants to make an automata that accepts 5000 random strings. So 10,000 isn't enough states for it. I'll drop the default limit to 10,000 again and just feed a million to that test case. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195065#comment-14195065 ] Nik Everett commented on LUCENE-6046: - I'll certainly add the regexp string to the exception message. And I'll merge the toStringTree from your patch into mine if you'd like. Yeah - QueryParserBase should have this option too. The thing I found most useful for debugging this was to call toDot on the automata before and after normalization. I just looked at it and went, oh, of course you have to do it that way. No wonder the states explode. And then I read https://en.wikipedia.org/wiki/Powerset_construction and remembered it from my rusty CS degree. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6046) RegExp.toAutomaton high memory use
[ https://issues.apache.org/jira/browse/LUCENE-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-6046: Attachment: LUCENE-6046.patch Next version with fixes based on Mike's feedback. RegExp.toAutomaton high memory use -- Key: LUCENE-6046 URL: https://issues.apache.org/jira/browse/LUCENE-6046 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 4.10.1 Reporter: Lee Hinman Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-6046.patch, LUCENE-6046.patch, LUCENE-6046.patch When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java. The following caused an OutOfMemoryError with a 32gb heap: {noformat} new RegExp(\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}).toAutomaton(); {noformat} When increased to a 60gb heap, the following exception is thrown: {noformat} 1 java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623) 1 __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0) 1 org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168) 1 org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295) 1 org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639) 1 org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62) 1 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477) 1 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4556) FuzzyTermsEnum creates tons of objects
[ https://issues.apache.org/jira/browse/LUCENE-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011014#comment-14011014 ] Nik Everett commented on LUCENE-4556: - I'm having GC trouble and I'm using the DirectCandidateGenerator. Its obviously kind of hard to tell how much the automata is contributing in production but when I try it locally just generating the automata for two or three terms takes about 200KB of memory. Napkin math (200KB * 250queries/second) says this makes about 50MB of garbage per second per index. Obviously it gets worse if you run this in a sharded context where each shard does the generating. Well, not really worse, but the large up front cost and memory consumption of this process is relatively static based on shard size so this becomes a reason to use larger shards. I should propose that in addition to Simon's patches another other option is to try to implement something like the stack based automaton simulation that the Schulz Mihov paper (the one that proposed the Lev automaton) describes in section 6. Its not useful for stuff like intersecting the enums but if you are willing to forgo that you could probably get away with much less memory consumption. FuzzyTermsEnum creates tons of objects -- Key: LUCENE-4556 URL: https://issues.apache.org/jira/browse/LUCENE-4556 Project: Lucene - Core Issue Type: Improvement Components: core/search, modules/spellchecker Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Michael McCandless Priority: Critical Fix For: 4.9, 5.0 Attachments: LUCENE-4556.patch, LUCENE-4556.patch I ran into this problem in production using the DirectSpellchecker. The number of objects created by the spellchecker shoot through the roof very very quickly. We ran about 130 queries and ended up with 2M transitions / states. We spend 50% of the time in GC just because of transitions. Other parts of the system behave just fine here. I talked quickly to robert and gave a POC a shot providing a LevenshteinAutomaton#toRunAutomaton(prefix, n) method to optimize this case and build a array based strucuture converted into UTF-8 directly instead of going through the object based APIs. This involved quite a bit of changes but they are all package private at this point. I have a patch that still has a fair set of nocommits but its shows that its possible and IMO worth the trouble to make this really useable in production. All tests pass with the patch - its a start -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5452) Combine matches from multiple fields into one with the postings highlighter
Nik Everett created LUCENE-5452: --- Summary: Combine matches from multiple fields into one with the postings highlighter Key: LUCENE-5452 URL: https://issues.apache.org/jira/browse/LUCENE-5452 Project: Lucene - Core Issue Type: Improvement Components: core/search Reporter: Nik Everett Priority: Minor Like you can do with the FVH, it'd be nice to be able combine matches from multiple fields with the postings highlighter. Note that the postings highlighter doesn't do phrase matching and doesn't use term boosts so some of the FVH's field combining features won't work. It'd be nice to get some of them, though. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5452) Combine matches from multiple fields into one with the postings highlighter
[ https://issues.apache.org/jira/browse/LUCENE-5452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904610#comment-13904610 ] Nik Everett commented on LUCENE-5452: - I hadn't really thought of doing it a level above. I like the idea. The only thing that jumps out at me about doing it this way is that there is only a single priority queue rather than multiple that have to be maintained and merged. I'm not sure if that outweighs the extra api complexity this adds. I'm also pretty sure the higher level approach is more likely to keep the careful linear reads that the PostingsHighlighter does. Combine matches from multiple fields into one with the postings highlighter --- Key: LUCENE-5452 URL: https://issues.apache.org/jira/browse/LUCENE-5452 Project: Lucene - Core Issue Type: Improvement Components: core/search Reporter: Nik Everett Priority: Minor Attachments: LUCENE-5452.patch Like you can do with the FVH, it'd be nice to be able combine matches from multiple fields with the postings highlighter. Note that the postings highlighter doesn't do phrase matching and doesn't use term boosts so some of the FVH's field combining features won't work. It'd be nice to get some of them, though. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens
[ https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5437: Attachment: (was: LUCENE-5437.patch) ASCIIFoldingFilter that emits both unfolded and folded tokens - Key: LUCENE-5437 URL: https://issues.apache.org/jira/browse/LUCENE-5437 Project: Lucene - Core Issue Type: Improvement Reporter: Nik Everett Assignee: Simon Willnauer Priority: Minor Attachments: LUCENE-5437.patch I've found myself wanting an ASCIIFoldingFilter that emits both the folded tokens and the original, unfolded tokens. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens
[ https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5437: Attachment: LUCENE-5437.patch Uploading new diff with changes Simon asked for. ASCIIFoldingFilter that emits both unfolded and folded tokens - Key: LUCENE-5437 URL: https://issues.apache.org/jira/browse/LUCENE-5437 Project: Lucene - Core Issue Type: Improvement Reporter: Nik Everett Assignee: Simon Willnauer Priority: Minor Attachments: LUCENE-5437.patch I've found myself wanting an ASCIIFoldingFilter that emits both the folded tokens and the original, unfolded tokens. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens
[ https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5437: Attachment: (was: LUCENE-5437.patch) ASCIIFoldingFilter that emits both unfolded and folded tokens - Key: LUCENE-5437 URL: https://issues.apache.org/jira/browse/LUCENE-5437 Project: Lucene - Core Issue Type: Improvement Reporter: Nik Everett Assignee: Simon Willnauer Priority: Minor Attachments: LUCENE-5437.patch I've found myself wanting an ASCIIFoldingFilter that emits both the folded tokens and the original, unfolded tokens. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens
[ https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13894500#comment-13894500 ] Nik Everett commented on LUCENE-5437: - I thought about that but my instinct was that duplicating with the keyword attribute would add overhead in the case where there aren't characters to fold which is by far the more common case. I think you'd also have to make supporting the keyword attribute optional so it wouldn't break backwards compatibility. I figured optionally supporting the keyword attribute would be about the same amount of work/code as only duplicating when required so I went that way. I went with adding the extra class and moving the real implementation to an absract base class more out of desire to be minimally invasive to the original then anything technical. ASCIIFoldingFilter that emits both unfolded and folded tokens - Key: LUCENE-5437 URL: https://issues.apache.org/jira/browse/LUCENE-5437 Project: Lucene - Core Issue Type: Improvement Reporter: Nik Everett Assignee: Simon Willnauer Priority: Minor Attachments: LUCENE-5437.patch I've found myself wanting an ASCIIFoldingFilter that emits both the folded tokens and the original, unfolded tokens. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens
[ https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13894563#comment-13894563 ] Nik Everett commented on LUCENE-5437: - I suppose I'm just used to abstract classes but you are right, the delegate would work better here. I'll make that change. Before I do, though, does my argument (more instinct, really) about only cloning the token if there is anything to fold make sense? If not I'll just add support for the keyword attribute with a version check. ASCIIFoldingFilter that emits both unfolded and folded tokens - Key: LUCENE-5437 URL: https://issues.apache.org/jira/browse/LUCENE-5437 Project: Lucene - Core Issue Type: Improvement Reporter: Nik Everett Assignee: Simon Willnauer Priority: Minor Attachments: LUCENE-5437.patch I've found myself wanting an ASCIIFoldingFilter that emits both the folded tokens and the original, unfolded tokens. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens
[ https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5437: Attachment: LUCENE-5437.patch Patch that uses a simple boolean rather than crazy subclassing. ASCIIFoldingFilter that emits both unfolded and folded tokens - Key: LUCENE-5437 URL: https://issues.apache.org/jira/browse/LUCENE-5437 Project: Lucene - Core Issue Type: Improvement Reporter: Nik Everett Assignee: Simon Willnauer Priority: Minor Attachments: LUCENE-5437.patch, LUCENE-5437.patch I've found myself wanting an ASCIIFoldingFilter that emits both the folded tokens and the original, unfolded tokens. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens
[ https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5437: Attachment: (was: LUCENE-5437.patch) ASCIIFoldingFilter that emits both unfolded and folded tokens - Key: LUCENE-5437 URL: https://issues.apache.org/jira/browse/LUCENE-5437 Project: Lucene - Core Issue Type: Improvement Reporter: Nik Everett Assignee: Simon Willnauer Priority: Minor Attachments: LUCENE-5437.patch, LUCENE-5437.patch I've found myself wanting an ASCIIFoldingFilter that emits both the folded tokens and the original, unfolded tokens. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens
[ https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5437: Attachment: LUCENE-5437.patch Minor improvement in the names of things in the tests. ASCIIFoldingFilter that emits both unfolded and folded tokens - Key: LUCENE-5437 URL: https://issues.apache.org/jira/browse/LUCENE-5437 Project: Lucene - Core Issue Type: Improvement Reporter: Nik Everett Assignee: Simon Willnauer Priority: Minor Attachments: LUCENE-5437.patch, LUCENE-5437.patch I've found myself wanting an ASCIIFoldingFilter that emits both the folded tokens and the original, unfolded tokens. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens
[ https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5437: Priority: Minor (was: Major) ASCIIFoldingFilter that emits both unfolded and folded tokens - Key: LUCENE-5437 URL: https://issues.apache.org/jira/browse/LUCENE-5437 Project: Lucene - Core Issue Type: Improvement Reporter: Nik Everett Priority: Minor I've found myself wanting an ASCIIFoldingFilter that emits both the folded tokens and the original, unfolded tokens. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens
Nik Everett created LUCENE-5437: --- Summary: ASCIIFoldingFilter that emits both unfolded and folded tokens Key: LUCENE-5437 URL: https://issues.apache.org/jira/browse/LUCENE-5437 Project: Lucene - Core Issue Type: Improvement Reporter: Nik Everett I've found myself wanting an ASCIIFoldingFilter that emits both the folded tokens and the original, unfolded tokens. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5437) ASCIIFoldingFilter that emits both unfolded and folded tokens
[ https://issues.apache.org/jira/browse/LUCENE-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5437: Attachment: LUCENE-5437.patch Sorry for moving so much code. ASCIIFoldingFilter that emits both unfolded and folded tokens - Key: LUCENE-5437 URL: https://issues.apache.org/jira/browse/LUCENE-5437 Project: Lucene - Core Issue Type: Improvement Reporter: Nik Everett Priority: Minor Attachments: LUCENE-5437.patch I've found myself wanting an ASCIIFoldingFilter that emits both the folded tokens and the original, unfolded tokens. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5435) CommonTermsQuery should be able to query fields other than the one used as a source of commonness
[ https://issues.apache.org/jira/browse/LUCENE-5435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5435: Attachment: LUCENE-5435.patch CommonTermsQuery should be able to query fields other than the one used as a source of commonness - Key: LUCENE-5435 URL: https://issues.apache.org/jira/browse/LUCENE-5435 Project: Lucene - Core Issue Type: Improvement Reporter: Nik Everett Attachments: LUCENE-5435.patch It'd be wonderful if I could use the commonness of one term, say the contents of a document, to power a search across both the document and its title. Continuing the metaphor, I'd like be able to build a query like this: the first that is rewritten into: (title:the OR body:the) +(title:first OR body:first) with the help of the CommonTermsQuery logic. Essentially, I'd like CommonTermsQuery to soften the implicit AND for the into and OR because it is common. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5435) CommonTermsQuery should be able to query fields other than the one used as a source of commonness
[ https://issues.apache.org/jira/browse/LUCENE-5435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5435: Priority: Minor (was: Major) CommonTermsQuery should be able to query fields other than the one used as a source of commonness - Key: LUCENE-5435 URL: https://issues.apache.org/jira/browse/LUCENE-5435 Project: Lucene - Core Issue Type: Improvement Reporter: Nik Everett Priority: Minor Attachments: LUCENE-5435.patch It'd be wonderful if I could use the commonness of one term, say the contents of a document, to power a search across both the document and its title. Continuing the metaphor, I'd like be able to build a query like this: the first that is rewritten into: (title:the OR body:the) +(title:first OR body:first) with the help of the CommonTermsQuery logic. Essentially, I'd like CommonTermsQuery to soften the implicit AND for the into and OR because it is common. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5435) CommonTermsQuery should be able to query fields other than the one used as a source of commonness
Nik Everett created LUCENE-5435: --- Summary: CommonTermsQuery should be able to query fields other than the one used as a source of commonness Key: LUCENE-5435 URL: https://issues.apache.org/jira/browse/LUCENE-5435 Project: Lucene - Core Issue Type: Improvement Reporter: Nik Everett Attachments: LUCENE-5435.patch It'd be wonderful if I could use the commonness of one term, say the contents of a document, to power a search across both the document and its title. Continuing the metaphor, I'd like be able to build a query like this: the first that is rewritten into: (title:the OR body:the) +(title:first OR body:first) with the help of the CommonTermsQuery logic. Essentially, I'd like CommonTermsQuery to soften the implicit AND for the into and OR because it is common. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5361) FVH throws away some boosts
[ https://issues.apache.org/jira/browse/LUCENE-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865566#comment-13865566 ] Nik Everett commented on LUCENE-5361: - Wonderful! Thanks. FVH throws away some boosts --- Key: LUCENE-5361 URL: https://issues.apache.org/jira/browse/LUCENE-5361 Project: Lucene - Core Issue Type: Bug Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Fix For: 4.6.1 Attachments: LUCENE-5361.patch The FVH's FieldQuery throws away some boosts when flattening queries, including DisjunctionMaxQuery and BooleanQuery queries. Fragments generated against queries containing boosted boolean queries don't end up sorted correctly. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5361) FVH throws away some boosts
Nik Everett created LUCENE-5361: --- Summary: FVH throws away some boosts Key: LUCENE-5361 URL: https://issues.apache.org/jira/browse/LUCENE-5361 Project: Lucene - Core Issue Type: Bug Reporter: Nik Everett Priority: Minor The FVH's FieldQuery throws away some boosts when flattening queries, including DisjunctionMaxQuery and BooleanQuery queries. Fragments generated against queries containing boosted boolean queries don't end up sorted correctly. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5361) FVH throws away some boosts
[ https://issues.apache.org/jira/browse/LUCENE-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5361: Attachment: LUCENE-5361.patch Fix the issue by pushing boosts from parent queries to child queries when the parent queries are flattened. I clone the child queries before setting their boost so I don't break anything that expects them unchanged. I'm not super happy that I have to clone the queries but it seemed like the simplest solution. FVH throws away some boosts --- Key: LUCENE-5361 URL: https://issues.apache.org/jira/browse/LUCENE-5361 Project: Lucene - Core Issue Type: Bug Reporter: Nik Everett Priority: Minor Attachments: LUCENE-5361.patch The FVH's FieldQuery throws away some boosts when flattening queries, including DisjunctionMaxQuery and BooleanQuery queries. Fragments generated against queries containing boosted boolean queries don't end up sorted correctly. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5285) FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields
[ https://issues.apache.org/jira/browse/LUCENE-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5285: Attachment: LUCENE-5285.patch Ah! += yeah. This fixes it and improves the test so it would notice the difference. FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields --- Key: LUCENE-5285 URL: https://issues.apache.org/jira/browse/LUCENE-5285 Project: Lucene - Core Issue Type: Bug Reporter: Nik Everett Priority: Minor Attachments: LUCENE-5285.patch, LUCENE-5285.patch FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields. This is only a problem when you want to sort the fragments by score. Technically BaseFragmentsBuilder (line 261 in my copy of the source) does the copying. Rather than copying the score I _think_ it'd be more right to pull that copying logic into a protected method that child classes (such as ScoreOrderFragmentsBuilder) can override to do more intelligent things. Exactly what that means isn't clear to me at the moment. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5285) FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields
[ https://issues.apache.org/jira/browse/LUCENE-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5285: Attachment: (was: LUCENE-5285.patch) FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields --- Key: LUCENE-5285 URL: https://issues.apache.org/jira/browse/LUCENE-5285 Project: Lucene - Core Issue Type: Bug Reporter: Nik Everett Priority: Minor Attachments: LUCENE-5285.patch FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields. This is only a problem when you want to sort the fragments by score. Technically BaseFragmentsBuilder (line 261 in my copy of the source) does the copying. Rather than copying the score I _think_ it'd be more right to pull that copying logic into a protected method that child classes (such as ScoreOrderFragmentsBuilder) can override to do more intelligent things. Exactly what that means isn't clear to me at the moment. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5285) FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields
[ https://issues.apache.org/jira/browse/LUCENE-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5285: Attachment: LUCENE-5285.patch New patch fixes my broken WeightedFragList change and expands WeightedFragListBuilderTest to catch the broken implementation. FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields --- Key: LUCENE-5285 URL: https://issues.apache.org/jira/browse/LUCENE-5285 Project: Lucene - Core Issue Type: Bug Reporter: Nik Everett Priority: Minor Attachments: LUCENE-5285.patch, LUCENE-5285.patch FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields. This is only a problem when you want to sort the fragments by score. Technically BaseFragmentsBuilder (line 261 in my copy of the source) does the copying. Rather than copying the score I _think_ it'd be more right to pull that copying logic into a protected method that child classes (such as ScoreOrderFragmentsBuilder) can override to do more intelligent things. Exactly what that means isn't clear to me at the moment. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5285) FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields
[ https://issues.apache.org/jira/browse/LUCENE-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13802776#comment-13802776 ] Nik Everett commented on LUCENE-5285: - I realized last night that I did the WeightedFragList incorrectly in that patch. I'll upload another one as time permits. FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields --- Key: LUCENE-5285 URL: https://issues.apache.org/jira/browse/LUCENE-5285 Project: Lucene - Core Issue Type: Bug Reporter: Nik Everett Priority: Minor Attachments: LUCENE-5285.patch FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields. This is only a problem when you want to sort the fragments by score. Technically BaseFragmentsBuilder (line 261 in my copy of the source) does the copying. Rather than copying the score I _think_ it'd be more right to pull that copying logic into a protected method that child classes (such as ScoreOrderFragmentsBuilder) can override to do more intelligent things. Exactly what that means isn't clear to me at the moment. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5285) FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields
[ https://issues.apache.org/jira/browse/LUCENE-5285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5285: Attachment: LUCENE-5285.patch This adds a boost member to FieldFragLists' SubInfo which is its contribution to the WeightedFragInfo's boost. When splitting WeightedFragInfo across fields the new info's score is the sum of the scores of all SubInfos it contains. FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields --- Key: LUCENE-5285 URL: https://issues.apache.org/jira/browse/LUCENE-5285 Project: Lucene - Core Issue Type: Bug Reporter: Nik Everett Priority: Minor Attachments: LUCENE-5285.patch FastVectorHighlighter copies segments scores when splitting segments across multi-valued fields. This is only a problem when you want to sort the fragments by score. Technically BaseFragmentsBuilder (line 261 in my copy of the source) does the copying. Rather than copying the score I _think_ it'd be more right to pull that copying logic into a protected method that child classes (such as ScoreOrderFragmentsBuilder) can override to do more intelligent things. Exactly what that means isn't clear to me at the moment. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5274: Attachment: LUCENE-5274.patch Change codestyle on MergedIterator changes and TestMergedIterator to match the style in the rest of core. The FVH changes still use the wide style prevalent in the FVH code. Also, sort fewer numbers in TestMergedIterator to make it faster. The only reason I was sorting so many the first time around was to get a good sense of what I was doing to the speed by adding the additional conditional. Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5274: Attachment: (was: LUCENE-5274.patch) Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5274: Attachment: LUCENE-5274.clean.patch Attached a version of the patch that applies cleanly but doesn't clearly show the changes to MergedIterator. I built it by svn rm and svn add rather than svn mv + edit. Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.clean.patch, LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5274: Attachment: (was: LUCENE-5274.clean.patch) Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5274: Attachment: (was: LUCENE-5274.patch) Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5274: Attachment: LUCENE-5274.patch Finally switch text to generated on the fly. No other changes. Patch _should_ apply cleanly but like the last one doesn't clearly show what I changed in MergedIterator. Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798512#comment-13798512 ] Nik Everett commented on LUCENE-5274: - {quote} (it removed MergedIterator.java) {quote} It was supposed to move it to the util package. I'll figure out what happened there. I agree with the other points but it is worth discussing the last one. The others I'll just make the changes you mention. I intentionally didn't update text in WeightedPhraseInfo.merge because it is documented as being for debugging so it didn't seem worth the cost. Would it make sense to remove the member entirely and generate it from stored terms when needed? It also doesn't update seqnum mostly because I really don't know the right way to update it. As for WeightedPhraseInfo's immutability - I didn't see any final members so setting up the state in the constructor and not having setters just looked more like it wanted to encapsulate logic rather than immutability. I'll swap the merge method with a merging constructor. Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5274: Attachment: LUCENE-5274.patch Fix all issues exception the text on WeightedPhraseInfo. If we're ok with building it on the fly then I'll get to that in the morning. I can't get the patch to apply cleanly - something to do with moving a file and then changing its contents. The closest I can come is: svn mv lucene/core/src/java/org/apache/lucene/index/MergedIterator.java lucene/core/src/java/org/apache/lucene/util/ patch -f -p0 ~/LUCENE-5274.patch svn add lucene/core/src/test/org/apache/lucene/util/TestMergedIterator.java I'm sure there is a better way to do this. If you get the chance please let me know. Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5274: Attachment: (was: LUCENE-5274.patch) Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794190#comment-13794190 ] Nik Everett commented on LUCENE-5274: - I'm having a look at what I can do to pull MergedIterator into the util package and give it nice unit tests. Almost done with that and I should be able to spin another version of this patch. I'm not exactly sure of a good way to test the synonym stuff in FastVectorHighlighterTest - I don't see a mock Synonym filter. Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5274: Attachment: (was: LUCENE-5274.patch) Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5274: Attachment: LUCENE-5274.patch Not done yet but progress: 1. Move MergedIterator to util. 2. Add a mode to it to not remove duplicates (one extra branch per call to next). 3. Add a unit test for MergedIterator. 4. Make FieldTermStack.TermInfo, FieldPhraseList.WeighterPhraseInfo, FieldPhraseList.WeightedPhraseInfo.Toffs consistent under equals, hashCode, and compareTo. I don't think any of them would make good hash keys but I fixed up hashCode because I fixed up equals. 5. Unit tests for point 4. 7. Use the non-duplicate removing mode of MergedIterator in FieldPhraseList's merge methods. 6. More tests in FastVectorHighlighterTest - mostly around exact equal matches and how they effect segment sorting. At this point this is left: 1. Unit tests for equal matches in the same FieldPhraseList. 2. Poke around with corner cases during merges. Test them in FastVectorHighlighterTest if they reflect mockable real world cases. Expand FieldPhraseListTest if they don't. 4. Remove highlighter dependency on analyzer module. Would it make sense to move PerFieldAnalyzerWrapper into core? Something else? 3. Anything else from review. Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5274: Attachment: (was: LUCENE-5274.patch) Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5274: Attachment: LUCENE-5274.patch Removed analyzer dependency and added tests covering synonyms. Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5274: Attachment: LUCENE-5274-4.patch Reworked to remove dependency on query parser and most of the analyzer dependency and to fix errors with phrases. It'll need to lose the rest of the analyzer dependency and have more test cases in addition to any other concerns raised in the review. Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274-4.patch, LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5274: Attachment: (was: LUCENE-5274.patch) Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5274: Attachment: (was: LUCENE-5274-4.patch) Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5274: Attachment: LUCENE-5274.patch New version of the patch. This one works a lot better with phrases and even works on fields that have the same source but different tokenizers. It still makes highlighting depend on the analysis module to pick up PerFieldAnalyzerWrapper. I think all the new code this adds to FieldPhraseList deserves a unit test on its own but I'm not in the frame of mind to write one at the moment so I'll have to come back to it later. Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13792913#comment-13792913 ] Nik Everett commented on LUCENE-5274: - Hey, forgot to mention that. MockTokenizer seems to throw away the character after the end of each token even if that character is the valid start to the next token. This comes up because I wanted to tokenize strings in a simplistic way to test that the highlighter can handle different tokenizers and it just wasn't working right. So I fixed MockTokenizer but I did it in a pretty brutal way. I'm happy to move the change to another bug and improve it but testing the highlighter change without it is a bit painful. Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5278) MockTokenizer throws away the character right after a token even if it is a valid start to a new token
Nik Everett created LUCENE-5278: --- Summary: MockTokenizer throws away the character right after a token even if it is a valid start to a new token Key: LUCENE-5278 URL: https://issues.apache.org/jira/browse/LUCENE-5278 Project: Lucene - Core Issue Type: Bug Reporter: Nik Everett Priority: Trivial MockTokenizer throws away the character right after a token even if it is a valid start to a new token. You won't see this unless you build a tokenizer that can recognize every character like with new RegExp(.) or RegExp(...). Changing this behaviour seems to break a number of tests. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13792974#comment-13792974 ] Nik Everett commented on LUCENE-5274: - Filed LUCENE-5278. Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5278) MockTokenizer throws away the character right after a token even if it is a valid start to a new token
[ https://issues.apache.org/jira/browse/LUCENE-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5278: Attachment: LUCENE-5278.patch This patch fixes the behaviour from my perspective but breaks a bunch of other tests. MockTokenizer throws away the character right after a token even if it is a valid start to a new token -- Key: LUCENE-5278 URL: https://issues.apache.org/jira/browse/LUCENE-5278 Project: Lucene - Core Issue Type: Bug Reporter: Nik Everett Priority: Trivial Attachments: LUCENE-5278.patch MockTokenizer throws away the character right after a token even if it is a valid start to a new token. You won't see this unless you build a tokenizer that can recognize every character like with new RegExp(.) or RegExp(...). Changing this behaviour seems to break a number of tests. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13793018#comment-13793018 ] Nik Everett commented on LUCENE-5274: - {quote} I can see the possible use case here, but I think it deserves some discussion first (versus just making it public). {quote} Sure! I'm more used to Guava's tools so I think I was lulled in to a false sense of recognition. No chance of updating to a modern version of Guava?:) {quote} This thing has limitations (its currently only used by indexwriter for buffereddeletes, its basically like a MultiTerms over an Iterator). For example each iterator it consumes should not have duplicate values according to its compareTo(): its not clear to me this WeightedPhraseInfo behaves this way {quote} Yikes! I didn't catch that but now that you point it out it is right there in the docs and I should have. WeightedPhraseInfo doesn't behave that way and {quote} Furthermore the class in question (WeightedPhraseInfo) is public, and adding Comparable to it looks like it will create a situation where its inconsistent with equals()... I think this is a little dangerous. {quote} I agree on the inconsistent with inconsistent with equals. I can either fix that or use a Comparator for sorting both WeightedPhraseInfo and Toffs. That'd require a MergeSorter that can take one but {quote} If it turns out we can reuse it: great! But i think rather than just slapping public on it, we should move it to .util, ensure it has good javadocs and unit tests, and investigate what exactly happens when these contracts are violated: e.g. can we make an exception happen rather than just broken behavior in a way that won't hurt performance and so on? {quote} Makes sense to me. Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13793038#comment-13793038 ] Nik Everett commented on LUCENE-5274: - {{quote}} There is no lucene dependency on guava. I don't think we should introduce one, and it wouldnt solve the issues i mentioned anyway (e.g. comparable inconsistent with equals and stuff). It would only add 2.1MB of bloated unnecessary syntactic sugar (sorry, thats just my opinion on it, i think its useless). We should keep our third party dependencies minimal and necessary so that any app using lucene can choose for itself what version of this stuff (if any) it wants to use. If we rely upon unnecessary stuff it hurts the end user by forcing them to compatible versions. {{quote}} I figured that was the reasoning and I don't intend to argue with it. In this case it would provide a method to merge sorted iterators just like MergedIterator only without the caveats around duplication but I'm happy to work around it. Guava certainly wouldn't fix my forgetting equals and hashcode. Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
Nik Everett created LUCENE-5274: --- Summary: Teach fast FastVectorHighlighter to highlight child fields with parent fields Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Priority: Minor I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5274: Attachment: LUCENE-5274.patch Patch implementing merging highlights on child fields. Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13791715#comment-13791715 ] Nik Everett commented on LUCENE-5274: - I've uploaded a patch for this. I made the highlighter module depend on the query string parser and analyzer modules for testing. I probably could have gotten away without the query string parser but it made the test cases simpler to write. The analyzer module was required to analyze different fields with different analyzers which is kind of the point of this feature. My ant-foo is too weak for me to be sure I didn't set up some kind of horrible circular dependency that hasn't hit me. Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5274) Teach fast FastVectorHighlighter to highlight child fields with parent fields
[ https://issues.apache.org/jira/browse/LUCENE-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13791945#comment-13791945 ] Nik Everett commented on LUCENE-5274: - We tend to avoid doing that in order to not have cross or circular dependencies between modules. This is not an issue at this stage of the patch but we should try replacing the analysis components you are using with MockAnalyzer at some point. The PerFieldAnalyzerWrapper was the thing that pulled me there. I'd appreciate some tips on how to work around that. I'll have a look at removing the query parser dependency. I'm also using the EnglishAnalzyer but I'm just using that to have a third analyzer in the mix. I'll see about using MockAnalzyer for that too. I only had a quick look at the patch so far and I'm a bit unsure about childFields. Maybe it would be better API-wise to specify the stored field and the index fields separately? Or maybe to retrieve the index fields from the terms of the query? What do you think? I don't like retrieving the indexed fields from the query - what if you don't want them all? how can you make sure that the ones that you take from the query really do share the same stored copy. As far as calling out the stored field and the indexed field separately - I think I like the idea. It'd let you load the source from a field that isn't actively being highlighted. I'll have a look at that. Teach fast FastVectorHighlighter to highlight child fields with parent fields --- Key: LUCENE-5274 URL: https://issues.apache.org/jira/browse/LUCENE-5274 Project: Lucene - Core Issue Type: Improvement Components: core/other Reporter: Nik Everett Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5274.patch I've been messing around with the FastVectorHighlighter and it looks like I can teach it to highlight matches on child fields. Like this query: foo:scissors foo_exact:running would highlight foo like this: emrunning/em with emscissors/em Where foo is stored WITH_POSITIONS_OFFSETS and foo_plain is an unstored copy of foo a different analyzer and its own WITH_POSITIONS_OFFSETS. This would make queries that perform weighted matches against different analyzers much more convenient to highlight. I have working code and test cases but they are hacked into Elasticsearch. I'd love to Lucene-ify if you'll take them. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5245) ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything before query weight is calculated
[ https://issues.apache.org/jira/browse/LUCENE-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778646#comment-13778646 ] Nik Everett commented on LUCENE-5245: - Thanks for jumping on this so quickly! ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything before query weight is calculated - Key: LUCENE-5245 URL: https://issues.apache.org/jira/browse/LUCENE-5245 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.4 Reporter: Nik Everett Assignee: Uwe Schindler Fix For: 5.0, 4.6 Attachments: LUCENE-5245.patch, LUCENE-5245.patch, LUCENE-5245.patch ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything before query weight is calculated. This dramatically changes the resulting score which is bad when comparing scores across different Lucene indexes/shards/whatever. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5245) ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything before query weight is calculated
Nik Everett created LUCENE-5245: --- Summary: ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything before query weight is calculated Key: LUCENE-5245 URL: https://issues.apache.org/jira/browse/LUCENE-5245 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.4 Reporter: Nik Everett ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything before query weight is calculated. This dramatically changes the resulting score which is bad when comparing scores across different Lucene indexes/shards/whatever. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5245) ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything before query weight is calculated
[ https://issues.apache.org/jira/browse/LUCENE-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778093#comment-13778093 ] Nik Everett commented on LUCENE-5245: - The query norm applied to the constant score query changes. Say I had a query string like foo:findm*^20 bar:findm* and only foo had a result on shard 1 and only bar had a result shard 2. Both end up with the same score because on shard one the query is rewritten to foo:findm*^20 (norm = .05) and bar:findm* (norm = 1). ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything before query weight is calculated - Key: LUCENE-5245 URL: https://issues.apache.org/jira/browse/LUCENE-5245 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.4 Reporter: Nik Everett ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything before query weight is calculated. This dramatically changes the resulting score which is bad when comparing scores across different Lucene indexes/shards/whatever. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5245) ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything before query weight is calculated
[ https://issues.apache.org/jira/browse/LUCENE-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nik Everett updated LUCENE-5245: Attachment: LUCENE-5245.patch This fixes my problem but I'm not sure how to setup unit tests in Lucene. ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything before query weight is calculated - Key: LUCENE-5245 URL: https://issues.apache.org/jira/browse/LUCENE-5245 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.4 Reporter: Nik Everett Attachments: LUCENE-5245.patch ConstantScoreAutoRewrite rewrites prefix queryies that don't match anything before query weight is calculated. This dramatically changes the resulting score which is bad when comparing scores across different Lucene indexes/shards/whatever. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org