[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2017-04-09 Thread Timothy055
Github user Timothy055 closed the pull request at:

https://github.com/apache/lucene-solr/pull/105


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread Timothy055
Github user Timothy055 commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85619786
  
--- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldOffsetStrategy.java
 ---
@@ -65,58 +65,88 @@ public String getField() {
*/
   public abstract List getOffsetsEnums(IndexReader reader, 
int docId, String content) throws IOException;
 
-  protected List createOffsetsEnums(LeafReader leafReader, 
int doc, TokenStream tokenStream) throws IOException {
-List offsetsEnums = 
createOffsetsEnumsFromReader(leafReader, doc);
-if (automata.length > 0) {
-  offsetsEnums.add(createOffsetsEnumFromTokenStream(doc, tokenStream));
+  protected List createOffsetsEnumsFromReader(LeafReader 
leafReader, int doc) throws IOException {
+final Terms termsIndex = leafReader.terms(field);
+if (termsIndex == null) {
+  return Collections.emptyList();
 }
-return offsetsEnums;
-  }
 
-  protected List createOffsetsEnumsFromReader(LeafReader 
atomicReader, int doc) throws IOException {
 // For strict positions, get a Map of term to Spans:
 //note: ScriptPhraseHelper.NONE does the right thing for these 
method calls
 final Map strictPhrasesTermToSpans =
-strictPhrases.getTermToSpans(atomicReader, doc);
+phraseHelper.getTermToSpans(leafReader, doc);
 // Usually simply wraps terms in a List; but if willRewrite() then can 
be expanded
 final List sourceTerms =
-strictPhrases.expandTermsIfRewrite(terms, 
strictPhrasesTermToSpans);
+phraseHelper.expandTermsIfRewrite(terms, strictPhrasesTermToSpans);
 
-final List offsetsEnums = new 
ArrayList<>(sourceTerms.size() + 1);
+final List offsetsEnums = new 
ArrayList<>(sourceTerms.size() + automata.length);
 
-Terms termsIndex = atomicReader == null || sourceTerms.isEmpty() ? 
null : atomicReader.terms(field);
-if (termsIndex != null) {
+// Handle sourceTerms:
+if (!sourceTerms.isEmpty()) {
   TermsEnum termsEnum = termsIndex.iterator();//does not return null
   for (BytesRef term : sourceTerms) {
-if (!termsEnum.seekExact(term)) {
-  continue; // term not found
-}
-PostingsEnum postingsEnum = termsEnum.postings(null, 
PostingsEnum.OFFSETS);
-if (postingsEnum == null) {
-  // no offsets or positions available
-  throw new IllegalArgumentException("field '" + field + "' was 
indexed without offsets, cannot highlight");
-}
-if (doc != postingsEnum.advance(doc)) { // now it's positioned, 
although may be exhausted
-  continue;
+if (termsEnum.seekExact(term)) {
+  PostingsEnum postingsEnum = termsEnum.postings(null, 
PostingsEnum.OFFSETS);
+
+  if (postingsEnum == null) {
+// no offsets or positions available
+throw new IllegalArgumentException("field '" + field + "' was 
indexed without offsets, cannot highlight");
+  }
+
+  if (doc == postingsEnum.advance(doc)) { // now it's positioned, 
although may be exhausted
+postingsEnum = phraseHelper.filterPostings(term, postingsEnum, 
strictPhrasesTermToSpans.get(term));
+if (postingsEnum != null) {
+  offsetsEnums.add(new OffsetsEnum(term, postingsEnum));
+}
+  }
 }
-postingsEnum = strictPhrases.filterPostings(term, postingsEnum, 
strictPhrasesTermToSpans.get(term));
-if (postingsEnum == null) {
-  continue;// completely filtered out
+  }
+}
+
+// Handle automata
+if (automata.length > 0) {
+  offsetsEnums.addAll(createAutomataOffsetsFromTerms(termsIndex, doc));
+}
+
+return offsetsEnums;
+  }
+
+  protected List createAutomataOffsetsFromTerms(Terms 
termsIndex, int doc) throws IOException {
+Map automataPostings = new 
IdentityHashMap<>(automata.length);
--- End diff --

Pushed that for now to see what you think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread Timothy055
Github user Timothy055 commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85618489
  
--- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/CompositePostingsEnum.java
 ---
@@ -0,0 +1,165 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.uhighlight;
+
+import java.io.IOException;
+import java.util.List;
+
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.PriorityQueue;
+
+
+final class CompositePostingsEnum extends PostingsEnum {
+
+  private static final int NO_MORE_POSITIONS = -2;
+  private final BytesRef term;
+  private final int freq;
+  private final PriorityQueue queue;
+
+
+  /**
+   * This class is used to ensure we don't over iterate the underlying
+   * postings enum by keeping track of the position relative to the
+   * frequency.
+   * Ideally this would've been an implementation of a PostingsEnum
+   * but it would have to delegate most methods and it seemed easier
+   * to just wrap the tweaked method.
+   */
+  private static final class BoundsCheckingPostingsEnum {
+
+
+private final PostingsEnum postingsEnum;
+private final int freq;
--- End diff --

Hmm, did you mean to calculate freq dynamically in the method and just use 
a remainingPositions count as walking over the postings? Instead of 
re-computing it's cached in the constructor, but if we modified it, the freq 
count would change as the postings were iterated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread Timothy055
Github user Timothy055 commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85616894
  
--- Diff: 
lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/visibility/TestUnifiedHighlighterExtensibility.java
 ---
@@ -79,7 +90,7 @@ public void testFieldOffsetStrategyExtensibility() {
   @Test
   public void testUnifiedHighlighterExtensibility() {
 final int maxLength = 1000;
-UnifiedHighlighter uh = new UnifiedHighlighter(null, new 
MockAnalyzer(random())){
+UnifiedHighlighter uh = new UnifiedHighlighter(null, new 
MockAnalyzer(new Random())){
--- End diff --

I'm a bit confused.  What I should I change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread Timothy055
Github user Timothy055 commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85616651
  
--- Diff: 
lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/visibility/TestUnifiedHighlighterExtensibility.java
 ---
@@ -79,7 +90,7 @@ public void testFieldOffsetStrategyExtensibility() {
   @Test
   public void testUnifiedHighlighterExtensibility() {
 final int maxLength = 1000;
-UnifiedHighlighter uh = new UnifiedHighlighter(null, new 
MockAnalyzer(random())){
+UnifiedHighlighter uh = new UnifiedHighlighter(null, new 
MockAnalyzer(new Random())){
--- End diff --

I recall now why I changed it.  I started getting this error and figured it 
was a change elsewhere: java.lang.IllegalStateException: No context information 
for thread: Thread[id=1, name=main, state=RUNNABLE, group=main]. Is this thread 
running under a class com.carrotsearch.randomizedtesting.RandomizedRunner 
runner context? Add @RunWith(class 
com.carrotsearch.randomizedtesting.RandomizedRunner.class) to your test class. 
Make sure your code accesses random contexts within @BeforeClass and 
@AfterClass boundary (for example, static test class initializers are not 
permitted to access random contexts).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread Timothy055
Github user Timothy055 commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85614277
  
--- Diff: 
lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/visibility/TestUnifiedHighlighterExtensibility.java
 ---
@@ -79,7 +90,7 @@ public void testFieldOffsetStrategyExtensibility() {
   @Test
   public void testUnifiedHighlighterExtensibility() {
 final int maxLength = 1000;
-UnifiedHighlighter uh = new UnifiedHighlighter(null, new 
MockAnalyzer(random())){
+UnifiedHighlighter uh = new UnifiedHighlighter(null, new 
MockAnalyzer(new Random())){
--- End diff --

fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread Timothy055
Github user Timothy055 commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85613814
  
--- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldOffsetStrategy.java
 ---
@@ -65,58 +65,88 @@ public String getField() {
*/
   public abstract List getOffsetsEnums(IndexReader reader, 
int docId, String content) throws IOException;
 
-  protected List createOffsetsEnums(LeafReader leafReader, 
int doc, TokenStream tokenStream) throws IOException {
-List offsetsEnums = 
createOffsetsEnumsFromReader(leafReader, doc);
-if (automata.length > 0) {
-  offsetsEnums.add(createOffsetsEnumFromTokenStream(doc, tokenStream));
+  protected List createOffsetsEnumsFromReader(LeafReader 
leafReader, int doc) throws IOException {
+final Terms termsIndex = leafReader.terms(field);
+if (termsIndex == null) {
+  return Collections.emptyList();
 }
-return offsetsEnums;
-  }
 
-  protected List createOffsetsEnumsFromReader(LeafReader 
atomicReader, int doc) throws IOException {
 // For strict positions, get a Map of term to Spans:
 //note: ScriptPhraseHelper.NONE does the right thing for these 
method calls
 final Map strictPhrasesTermToSpans =
-strictPhrases.getTermToSpans(atomicReader, doc);
+phraseHelper.getTermToSpans(leafReader, doc);
 // Usually simply wraps terms in a List; but if willRewrite() then can 
be expanded
 final List sourceTerms =
-strictPhrases.expandTermsIfRewrite(terms, 
strictPhrasesTermToSpans);
+phraseHelper.expandTermsIfRewrite(terms, strictPhrasesTermToSpans);
 
-final List offsetsEnums = new 
ArrayList<>(sourceTerms.size() + 1);
+final List offsetsEnums = new 
ArrayList<>(sourceTerms.size() + automata.length);
 
-Terms termsIndex = atomicReader == null || sourceTerms.isEmpty() ? 
null : atomicReader.terms(field);
-if (termsIndex != null) {
+// Handle sourceTerms:
+if (!sourceTerms.isEmpty()) {
   TermsEnum termsEnum = termsIndex.iterator();//does not return null
   for (BytesRef term : sourceTerms) {
-if (!termsEnum.seekExact(term)) {
-  continue; // term not found
-}
-PostingsEnum postingsEnum = termsEnum.postings(null, 
PostingsEnum.OFFSETS);
-if (postingsEnum == null) {
-  // no offsets or positions available
-  throw new IllegalArgumentException("field '" + field + "' was 
indexed without offsets, cannot highlight");
-}
-if (doc != postingsEnum.advance(doc)) { // now it's positioned, 
although may be exhausted
-  continue;
+if (termsEnum.seekExact(term)) {
+  PostingsEnum postingsEnum = termsEnum.postings(null, 
PostingsEnum.OFFSETS);
+
+  if (postingsEnum == null) {
+// no offsets or positions available
+throw new IllegalArgumentException("field '" + field + "' was 
indexed without offsets, cannot highlight");
+  }
+
+  if (doc == postingsEnum.advance(doc)) { // now it's positioned, 
although may be exhausted
+postingsEnum = phraseHelper.filterPostings(term, postingsEnum, 
strictPhrasesTermToSpans.get(term));
+if (postingsEnum != null) {
+  offsetsEnums.add(new OffsetsEnum(term, postingsEnum));
+}
+  }
 }
-postingsEnum = strictPhrases.filterPostings(term, postingsEnum, 
strictPhrasesTermToSpans.get(term));
-if (postingsEnum == null) {
-  continue;// completely filtered out
+  }
+}
+
+// Handle automata
+if (automata.length > 0) {
+  offsetsEnums.addAll(createAutomataOffsetsFromTerms(termsIndex, doc));
+}
+
+return offsetsEnums;
+  }
+
+  protected List createAutomataOffsetsFromTerms(Terms 
termsIndex, int doc) throws IOException {
+Map automataPostings = new 
IdentityHashMap<>(automata.length);
--- End diff --

How about List that should give better locality without 
lose of type safety?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread Timothy055
Github user Timothy055 commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85613130
  
--- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldOffsetStrategy.java
 ---
@@ -65,58 +65,88 @@ public String getField() {
*/
   public abstract List getOffsetsEnums(IndexReader reader, 
int docId, String content) throws IOException;
 
-  protected List createOffsetsEnums(LeafReader leafReader, 
int doc, TokenStream tokenStream) throws IOException {
-List offsetsEnums = 
createOffsetsEnumsFromReader(leafReader, doc);
-if (automata.length > 0) {
-  offsetsEnums.add(createOffsetsEnumFromTokenStream(doc, tokenStream));
+  protected List createOffsetsEnumsFromReader(LeafReader 
leafReader, int doc) throws IOException {
+final Terms termsIndex = leafReader.terms(field);
+if (termsIndex == null) {
+  return Collections.emptyList();
 }
-return offsetsEnums;
-  }
 
-  protected List createOffsetsEnumsFromReader(LeafReader 
atomicReader, int doc) throws IOException {
 // For strict positions, get a Map of term to Spans:
 //note: ScriptPhraseHelper.NONE does the right thing for these 
method calls
 final Map strictPhrasesTermToSpans =
-strictPhrases.getTermToSpans(atomicReader, doc);
+phraseHelper.getTermToSpans(leafReader, doc);
 // Usually simply wraps terms in a List; but if willRewrite() then can 
be expanded
 final List sourceTerms =
-strictPhrases.expandTermsIfRewrite(terms, 
strictPhrasesTermToSpans);
+phraseHelper.expandTermsIfRewrite(terms, strictPhrasesTermToSpans);
 
-final List offsetsEnums = new 
ArrayList<>(sourceTerms.size() + 1);
+final List offsetsEnums = new 
ArrayList<>(sourceTerms.size() + automata.length);
 
-Terms termsIndex = atomicReader == null || sourceTerms.isEmpty() ? 
null : atomicReader.terms(field);
-if (termsIndex != null) {
+// Handle sourceTerms:
+if (!sourceTerms.isEmpty()) {
   TermsEnum termsEnum = termsIndex.iterator();//does not return null
   for (BytesRef term : sourceTerms) {
-if (!termsEnum.seekExact(term)) {
-  continue; // term not found
-}
-PostingsEnum postingsEnum = termsEnum.postings(null, 
PostingsEnum.OFFSETS);
-if (postingsEnum == null) {
-  // no offsets or positions available
-  throw new IllegalArgumentException("field '" + field + "' was 
indexed without offsets, cannot highlight");
-}
-if (doc != postingsEnum.advance(doc)) { // now it's positioned, 
although may be exhausted
-  continue;
+if (termsEnum.seekExact(term)) {
+  PostingsEnum postingsEnum = termsEnum.postings(null, 
PostingsEnum.OFFSETS);
+
+  if (postingsEnum == null) {
+// no offsets or positions available
+throw new IllegalArgumentException("field '" + field + "' was 
indexed without offsets, cannot highlight");
+  }
+
+  if (doc == postingsEnum.advance(doc)) { // now it's positioned, 
although may be exhausted
+postingsEnum = phraseHelper.filterPostings(term, postingsEnum, 
strictPhrasesTermToSpans.get(term));
+if (postingsEnum != null) {
+  offsetsEnums.add(new OffsetsEnum(term, postingsEnum));
+}
+  }
 }
-postingsEnum = strictPhrases.filterPostings(term, postingsEnum, 
strictPhrasesTermToSpans.get(term));
-if (postingsEnum == null) {
-  continue;// completely filtered out
+  }
+}
+
+// Handle automata
+if (automata.length > 0) {
+  offsetsEnums.addAll(createAutomataOffsetsFromTerms(termsIndex, doc));
+}
+
+return offsetsEnums;
+  }
+
+  protected List createAutomataOffsetsFromTerms(Terms 
termsIndex, int doc) throws IOException {
+Map automataPostings = new 
IdentityHashMap<>(automata.length);
--- End diff --

One minor problem with that was that the code would not longer by type-safe 
because of the lack of generic arrays in java.  I wouldn't be able to do 
`List[]` = new ArrayList[automata.length];` but 
could do `List[]` = new ArrayList[automata.length];` with 
unchecked casts.  Seem worth it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-

[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread Timothy055
Github user Timothy055 commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85612011
  
--- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/AnalysisOffsetStrategy.java
 ---
@@ -17,174 +17,28 @@
 package org.apache.lucene.search.uhighlight;
 
 import java.io.IOException;
-import java.util.ArrayList;
-import java.util.Arrays;
-import java.util.Collections;
-import java.util.List;
 
 import org.apache.lucene.analysis.Analyzer;
-import org.apache.lucene.analysis.FilteringTokenFilter;
 import org.apache.lucene.analysis.TokenStream;
-import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
-import org.apache.lucene.index.IndexReader;
-import org.apache.lucene.index.LeafReader;
-import org.apache.lucene.index.Terms;
-import org.apache.lucene.index.memory.MemoryIndex;
-import org.apache.lucene.search.spans.SpanQuery;
 import org.apache.lucene.util.BytesRef;
-import org.apache.lucene.util.automaton.Automata;
 import org.apache.lucene.util.automaton.CharacterRunAutomaton;
 
+public abstract class AnalysisOffsetStrategy extends FieldOffsetStrategy {
--- End diff --

Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread Timothy055
Github user Timothy055 commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85611978
  
--- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/CompositePostingsEnum.java
 ---
@@ -0,0 +1,165 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.uhighlight;
+
+import java.io.IOException;
+import java.util.List;
+
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.PriorityQueue;
+
+
+final class CompositePostingsEnum extends PostingsEnum {
+
+  private static final int NO_MORE_POSITIONS = -2;
+  private final BytesRef term;
+  private final int freq;
+  private final PriorityQueue queue;
+
+
+  /**
+   * This class is used to ensure we don't over iterate the underlying
+   * postings enum by keeping track of the position relative to the
+   * frequency.
+   * Ideally this would've been an implementation of a PostingsEnum
+   * but it would have to delegate most methods and it seemed easier
+   * to just wrap the tweaked method.
+   */
+  private static final class BoundsCheckingPostingsEnum {
+
+
+private final PostingsEnum postingsEnum;
+private final int freq;
+private int position;
+private int nextPosition;
+private int positionInc = 1;
+
+private int startOffset;
+private int endOffset;
+
+BoundsCheckingPostingsEnum(PostingsEnum postingsEnum) throws 
IOException {
+  this.postingsEnum = postingsEnum;
+  this.freq = postingsEnum.freq();
+  nextPosition = postingsEnum.nextPosition();
+  position = nextPosition;
+  startOffset = postingsEnum.startOffset();
+  endOffset = postingsEnum.endOffset();
+}
+
+private boolean hasMorePositions() throws IOException {
+  return positionInc < freq;
+}
+
+/**
+ * Returns the next position of the underlying postings enum unless
+ * it cannot iterate further and returns NO_MORE_POSITIONS;
+ * @return
+ * @throws IOException
+ */
+private int nextPosition() throws IOException {
+  position = nextPosition;
+  startOffset = postingsEnum.startOffset();
+  endOffset = postingsEnum.endOffset();
+  if (hasMorePositions()) {
+positionInc++;
+nextPosition = postingsEnum.nextPosition();
+  } else {
+nextPosition = NO_MORE_POSITIONS;
+  }
+  return position;
+}
+
+  }
+
+  CompositePostingsEnum(BytesRef term, List postingsEnums) 
throws IOException {
+this.term = term;
+queue = new 
PriorityQueue(postingsEnums.size()) {
+  @Override
+  protected boolean lessThan(BoundsCheckingPostingsEnum a, 
BoundsCheckingPostingsEnum b) {
+return a.position < b.position;
+  }
+};
+
+int freqAdd = 0;
+for (PostingsEnum postingsEnum : postingsEnums) {
+  queue.add(new BoundsCheckingPostingsEnum(postingsEnum));
+  freqAdd += postingsEnum.freq();
+}
+freq = freqAdd;
+  }
+
+  @Override
+  public int freq() throws IOException {
+return freq;
+  }
+
+  @Override
+  public int nextPosition() throws IOException {
+int position = NO_MORE_POSITIONS;
+while (queue.size() >= 1) {
+  queue.top().nextPosition();
+  queue.updateTop(); //the new position may be behind another 
postingsEnum in the queue
+  position = queue.top().position;
+
+  if (position == NO_MORE_POSITIONS) {
+queue.pop(); //this postingsEnum is consumed, let's get rid of it
+  } else {
+break; //we got a new position
+  }
+
+}
+return position;
+  }
+
+  @Override
+  public int startOffset() throws IOException {
+return 

[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread dsmiley
Github user dsmiley commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85606885
  
--- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/Passage.java ---
@@ -40,7 +40,7 @@
 BytesRef matchTerms[] = new BytesRef[8];
 int numMatches = 0;
 
-void addMatch(int startOffset, int endOffset, BytesRef term) {
+public void addMatch(int startOffset, int endOffset, BytesRef term) {
--- End diff --

Excellent; now it's possible for someone to override 
`FieldHighlighter.highlightOffsetsEnums()` to make Passages.  But do add 
@lucene.experimental.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread dsmiley
Github user dsmiley commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85607859
  
--- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java
 ---
@@ -116,6 +116,8 @@
 
   private boolean defaultHighlightPhrasesStrictly = true; // AKA 
"accuracy" or "query debugging"
 
+  private boolean defaultPassageRelevancyOverSpeed = true; //Prefer using 
a memory index
--- End diff --

Suggest the comment be: For analysis, prefer MemoryIndex approach


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread dsmiley
Github user dsmiley commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85603812
  
--- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/CompositePostingsEnum.java
 ---
@@ -0,0 +1,165 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.uhighlight;
+
+import java.io.IOException;
+import java.util.List;
+
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.PriorityQueue;
+
+
+final class CompositePostingsEnum extends PostingsEnum {
+
+  private static final int NO_MORE_POSITIONS = -2;
+  private final BytesRef term;
+  private final int freq;
+  private final PriorityQueue queue;
+
+
+  /**
+   * This class is used to ensure we don't over iterate the underlying
+   * postings enum by keeping track of the position relative to the
+   * frequency.
+   * Ideally this would've been an implementation of a PostingsEnum
+   * but it would have to delegate most methods and it seemed easier
+   * to just wrap the tweaked method.
+   */
+  private static final class BoundsCheckingPostingsEnum {
+
+
+private final PostingsEnum postingsEnum;
+private final int freq;
+private int position;
+private int nextPosition;
+private int positionInc = 1;
+
+private int startOffset;
+private int endOffset;
+
+BoundsCheckingPostingsEnum(PostingsEnum postingsEnum) throws 
IOException {
+  this.postingsEnum = postingsEnum;
+  this.freq = postingsEnum.freq();
+  nextPosition = postingsEnum.nextPosition();
+  position = nextPosition;
+  startOffset = postingsEnum.startOffset();
+  endOffset = postingsEnum.endOffset();
+}
+
+private boolean hasMorePositions() throws IOException {
+  return positionInc < freq;
+}
+
+/**
+ * Returns the next position of the underlying postings enum unless
+ * it cannot iterate further and returns NO_MORE_POSITIONS;
+ * @return
+ * @throws IOException
+ */
+private int nextPosition() throws IOException {
+  position = nextPosition;
+  startOffset = postingsEnum.startOffset();
+  endOffset = postingsEnum.endOffset();
+  if (hasMorePositions()) {
+positionInc++;
+nextPosition = postingsEnum.nextPosition();
+  } else {
+nextPosition = NO_MORE_POSITIONS;
+  }
+  return position;
+}
+
+  }
+
+  CompositePostingsEnum(BytesRef term, List postingsEnums) 
throws IOException {
+this.term = term;
+queue = new 
PriorityQueue(postingsEnums.size()) {
+  @Override
+  protected boolean lessThan(BoundsCheckingPostingsEnum a, 
BoundsCheckingPostingsEnum b) {
+return a.position < b.position;
--- End diff --

In the event the positions are equal (e.g. two terms a the same position in 
which the wildcard matches both), we might want to fall-back on startOffset 
then endOffset?  Or maybe simply ignore position altogether and just do 
offsets, so then you needn't even track the position?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread dsmiley
Github user dsmiley commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85608645
  
--- Diff: 
lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/visibility/TestUnifiedHighlighterExtensibility.java
 ---
@@ -79,7 +90,7 @@ public void testFieldOffsetStrategyExtensibility() {
   @Test
   public void testUnifiedHighlighterExtensibility() {
 final int maxLength = 1000;
-UnifiedHighlighter uh = new UnifiedHighlighter(null, new 
MockAnalyzer(random())){
+UnifiedHighlighter uh = new UnifiedHighlighter(null, new 
MockAnalyzer(new Random())){
--- End diff --

? why not `random()` ?  This will likely fail precommit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread dsmiley
Github user dsmiley commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85607262
  
--- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/TokenStreamOffsetStrategy.java
 ---
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.uhighlight;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.automaton.CharacterRunAutomaton;
+
+public class TokenStreamOffsetStrategy extends AnalysisOffsetStrategy {
+
+  private static final BytesRef[] ZERO_LEN_BYTES_REF_ARRAY = new 
BytesRef[0];
+
+  public TokenStreamOffsetStrategy(String field, BytesRef[] terms, 
PhraseHelper phraseHelper, CharacterRunAutomaton[] automata, Analyzer 
indexAnalyzer) {
+super(field, terms, phraseHelper, automata, indexAnalyzer);
+this.automata = convertTermsToAutomata(terms, automata);
+this.terms = ZERO_LEN_BYTES_REF_ARRAY;
+  }
+
+  @Override
+  public List getOffsetsEnums(IndexReader reader, int docId, 
String content) throws IOException {
+TokenStream tokenStream = tokenStream(content);
+PostingsEnum mtqPostingsEnum = 
MultiTermHighlighting.getDocsEnum(tokenStream, automata);
--- End diff --

I think there's a case to be made in moving ` 
MultiTermHighlighting.getDocsEnum` into this class, to thus keep the 
TokenStream aspect more isolated?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread dsmiley
Github user dsmiley commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85602404
  
--- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/CompositePostingsEnum.java
 ---
@@ -0,0 +1,165 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.uhighlight;
+
+import java.io.IOException;
+import java.util.List;
+
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.PriorityQueue;
+
+
+final class CompositePostingsEnum extends PostingsEnum {
--- End diff --

A comment would be helpful to explain the scope/purpose.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread dsmiley
Github user dsmiley commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85602867
  
--- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/CompositePostingsEnum.java
 ---
@@ -0,0 +1,165 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.uhighlight;
+
+import java.io.IOException;
+import java.util.List;
+
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.PriorityQueue;
+
+
+final class CompositePostingsEnum extends PostingsEnum {
+
+  private static final int NO_MORE_POSITIONS = -2;
+  private final BytesRef term;
+  private final int freq;
+  private final PriorityQueue queue;
+
+
+  /**
+   * This class is used to ensure we don't over iterate the underlying
+   * postings enum by keeping track of the position relative to the
+   * frequency.
+   * Ideally this would've been an implementation of a PostingsEnum
+   * but it would have to delegate most methods and it seemed easier
+   * to just wrap the tweaked method.
+   */
+  private static final class BoundsCheckingPostingsEnum {
+
+
+private final PostingsEnum postingsEnum;
+private final int freq;
--- End diff --

Instead of holding `freq` and `nextPosition`, why not just 
`remainingPositions`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread dsmiley
Github user dsmiley commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85606333
  
--- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldOffsetStrategy.java
 ---
@@ -65,58 +65,88 @@ public String getField() {
*/
   public abstract List getOffsetsEnums(IndexReader reader, 
int docId, String content) throws IOException;
 
-  protected List createOffsetsEnums(LeafReader leafReader, 
int doc, TokenStream tokenStream) throws IOException {
-List offsetsEnums = 
createOffsetsEnumsFromReader(leafReader, doc);
-if (automata.length > 0) {
-  offsetsEnums.add(createOffsetsEnumFromTokenStream(doc, tokenStream));
+  protected List createOffsetsEnumsFromReader(LeafReader 
leafReader, int doc) throws IOException {
+final Terms termsIndex = leafReader.terms(field);
+if (termsIndex == null) {
+  return Collections.emptyList();
 }
-return offsetsEnums;
-  }
 
-  protected List createOffsetsEnumsFromReader(LeafReader 
atomicReader, int doc) throws IOException {
 // For strict positions, get a Map of term to Spans:
 //note: ScriptPhraseHelper.NONE does the right thing for these 
method calls
 final Map strictPhrasesTermToSpans =
-strictPhrases.getTermToSpans(atomicReader, doc);
+phraseHelper.getTermToSpans(leafReader, doc);
 // Usually simply wraps terms in a List; but if willRewrite() then can 
be expanded
 final List sourceTerms =
-strictPhrases.expandTermsIfRewrite(terms, 
strictPhrasesTermToSpans);
+phraseHelper.expandTermsIfRewrite(terms, strictPhrasesTermToSpans);
 
-final List offsetsEnums = new 
ArrayList<>(sourceTerms.size() + 1);
+final List offsetsEnums = new 
ArrayList<>(sourceTerms.size() + automata.length);
 
-Terms termsIndex = atomicReader == null || sourceTerms.isEmpty() ? 
null : atomicReader.terms(field);
-if (termsIndex != null) {
+// Handle sourceTerms:
+if (!sourceTerms.isEmpty()) {
   TermsEnum termsEnum = termsIndex.iterator();//does not return null
   for (BytesRef term : sourceTerms) {
-if (!termsEnum.seekExact(term)) {
-  continue; // term not found
-}
-PostingsEnum postingsEnum = termsEnum.postings(null, 
PostingsEnum.OFFSETS);
-if (postingsEnum == null) {
-  // no offsets or positions available
-  throw new IllegalArgumentException("field '" + field + "' was 
indexed without offsets, cannot highlight");
-}
-if (doc != postingsEnum.advance(doc)) { // now it's positioned, 
although may be exhausted
-  continue;
+if (termsEnum.seekExact(term)) {
+  PostingsEnum postingsEnum = termsEnum.postings(null, 
PostingsEnum.OFFSETS);
+
+  if (postingsEnum == null) {
+// no offsets or positions available
+throw new IllegalArgumentException("field '" + field + "' was 
indexed without offsets, cannot highlight");
+  }
+
+  if (doc == postingsEnum.advance(doc)) { // now it's positioned, 
although may be exhausted
+postingsEnum = phraseHelper.filterPostings(term, postingsEnum, 
strictPhrasesTermToSpans.get(term));
+if (postingsEnum != null) {
+  offsetsEnums.add(new OffsetsEnum(term, postingsEnum));
+}
+  }
 }
-postingsEnum = strictPhrases.filterPostings(term, postingsEnum, 
strictPhrasesTermToSpans.get(term));
-if (postingsEnum == null) {
-  continue;// completely filtered out
+  }
+}
+
+// Handle automata
+if (automata.length > 0) {
+  offsetsEnums.addAll(createAutomataOffsetsFromTerms(termsIndex, doc));
+}
+
+return offsetsEnums;
+  }
+
+  protected List createAutomataOffsetsFromTerms(Terms 
termsIndex, int doc) throws IOException {
+Map automataPostings = new 
IdentityHashMap<>(automata.length);
--- End diff --

I suggest a parallel array to automata, so that later you can avoid a map 
lookup on each matching term.  Also, I suggest lazy-initializing the array 
later... perhaps some wildcards in a disjunction might never match.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional 

[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread dsmiley
Github user dsmiley commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85602210
  
--- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/AnalysisOffsetStrategy.java
 ---
@@ -17,174 +17,28 @@
 package org.apache.lucene.search.uhighlight;
 
 import java.io.IOException;
-import java.util.ArrayList;
-import java.util.Arrays;
-import java.util.Collections;
-import java.util.List;
 
 import org.apache.lucene.analysis.Analyzer;
-import org.apache.lucene.analysis.FilteringTokenFilter;
 import org.apache.lucene.analysis.TokenStream;
-import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
-import org.apache.lucene.index.IndexReader;
-import org.apache.lucene.index.LeafReader;
-import org.apache.lucene.index.Terms;
-import org.apache.lucene.index.memory.MemoryIndex;
-import org.apache.lucene.search.spans.SpanQuery;
 import org.apache.lucene.util.BytesRef;
-import org.apache.lucene.util.automaton.Automata;
 import org.apache.lucene.util.automaton.CharacterRunAutomaton;
 
+public abstract class AnalysisOffsetStrategy extends FieldOffsetStrategy {
--- End diff --

All public classes need a javadoc comment.  Remember lucene.internal for 
this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread dsmiley
Github user dsmiley commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85603003
  
--- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/CompositePostingsEnum.java
 ---
@@ -0,0 +1,165 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.uhighlight;
+
+import java.io.IOException;
+import java.util.List;
+
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.PriorityQueue;
+
+
+final class CompositePostingsEnum extends PostingsEnum {
+
+  private static final int NO_MORE_POSITIONS = -2;
+  private final BytesRef term;
+  private final int freq;
+  private final PriorityQueue queue;
+
+
+  /**
+   * This class is used to ensure we don't over iterate the underlying
+   * postings enum by keeping track of the position relative to the
+   * frequency.
+   * Ideally this would've been an implementation of a PostingsEnum
+   * but it would have to delegate most methods and it seemed easier
+   * to just wrap the tweaked method.
+   */
+  private static final class BoundsCheckingPostingsEnum {
+
+
+private final PostingsEnum postingsEnum;
+private final int freq;
+private int position;
+private int nextPosition;
+private int positionInc = 1;
+
+private int startOffset;
--- End diff --

Don't need these.  They can be fetched on-demand from the head of the 
queue, easily & cheaply enough.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread dsmiley
Github user dsmiley commented on a diff in the pull request:

https://github.com/apache/lucene-solr/pull/105#discussion_r85604667
  
--- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/CompositePostingsEnum.java
 ---
@@ -0,0 +1,165 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.uhighlight;
+
+import java.io.IOException;
+import java.util.List;
+
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.PriorityQueue;
+
+
+final class CompositePostingsEnum extends PostingsEnum {
+
+  private static final int NO_MORE_POSITIONS = -2;
+  private final BytesRef term;
+  private final int freq;
+  private final PriorityQueue queue;
+
+
+  /**
+   * This class is used to ensure we don't over iterate the underlying
+   * postings enum by keeping track of the position relative to the
+   * frequency.
+   * Ideally this would've been an implementation of a PostingsEnum
+   * but it would have to delegate most methods and it seemed easier
+   * to just wrap the tweaked method.
+   */
+  private static final class BoundsCheckingPostingsEnum {
+
+
+private final PostingsEnum postingsEnum;
+private final int freq;
+private int position;
+private int nextPosition;
+private int positionInc = 1;
+
+private int startOffset;
+private int endOffset;
+
+BoundsCheckingPostingsEnum(PostingsEnum postingsEnum) throws 
IOException {
+  this.postingsEnum = postingsEnum;
+  this.freq = postingsEnum.freq();
+  nextPosition = postingsEnum.nextPosition();
+  position = nextPosition;
+  startOffset = postingsEnum.startOffset();
+  endOffset = postingsEnum.endOffset();
+}
+
+private boolean hasMorePositions() throws IOException {
+  return positionInc < freq;
+}
+
+/**
+ * Returns the next position of the underlying postings enum unless
+ * it cannot iterate further and returns NO_MORE_POSITIONS;
+ * @return
+ * @throws IOException
+ */
+private int nextPosition() throws IOException {
+  position = nextPosition;
+  startOffset = postingsEnum.startOffset();
+  endOffset = postingsEnum.endOffset();
+  if (hasMorePositions()) {
+positionInc++;
+nextPosition = postingsEnum.nextPosition();
+  } else {
+nextPosition = NO_MORE_POSITIONS;
+  }
+  return position;
+}
+
+  }
+
+  CompositePostingsEnum(BytesRef term, List postingsEnums) 
throws IOException {
+this.term = term;
+queue = new 
PriorityQueue(postingsEnums.size()) {
+  @Override
+  protected boolean lessThan(BoundsCheckingPostingsEnum a, 
BoundsCheckingPostingsEnum b) {
+return a.position < b.position;
+  }
+};
+
+int freqAdd = 0;
+for (PostingsEnum postingsEnum : postingsEnums) {
+  queue.add(new BoundsCheckingPostingsEnum(postingsEnum));
+  freqAdd += postingsEnum.freq();
+}
+freq = freqAdd;
+  }
+
+  @Override
+  public int freq() throws IOException {
+return freq;
+  }
+
+  @Override
+  public int nextPosition() throws IOException {
+int position = NO_MORE_POSITIONS;
+while (queue.size() >= 1) {
+  queue.top().nextPosition();
+  queue.updateTop(); //the new position may be behind another 
postingsEnum in the queue
+  position = queue.top().position;
+
+  if (position == NO_MORE_POSITIONS) {
+queue.pop(); //this postingsEnum is consumed, let's get rid of it
+  } else {
+break; //we got a new position
+  }
+
+}
+return position;
+  }
+
+  @Override
+  public int startOffset() throws IOException {
+return 

[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

2016-10-28 Thread Timothy055
GitHub user Timothy055 opened a pull request:

https://github.com/apache/lucene-solr/pull/105

LUCENE-7526 Improvements to UnifiedHighlighter OffsetStrategies

Pull request for LUCENE-7526

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Timothy055/lucene-solr master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/lucene-solr/pull/105.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #105


commit 02e932c4a6146363680b88f4947a693c6697c955
Author: Timothy Rodriguez 
Date:   2016-09-01T19:23:50Z

Initial fork of PostingsHighlighter for UnifiedHighlighter

commit 9d88411b3985a98851384d78d681431dba710e89
Author: Timothy Rodriguez 
Date:   2016-09-01T23:17:06Z

Initial commit of the UnifiedHighlighter for OSS contribution

commit e45e39bc4b07ea33e4423b264c2fefb9aa08777a
Author: David Smiley 
Date:   2016-09-02T12:45:49Z

Fix misc issues; "ant test" now works. (#1)

commit 046a28ef31acf4cea7d2554b827e6a714e3d
Author: Timothy Rodriguez 
Date:   2016-09-02T20:58:31Z

Minor refactoring of the AnalysisFieldHighlighter

commit ccd1a2280abd4b48cfef8122696e5d9cfd12920f
Author: David Smiley 
Date:   2016-09-03T12:55:20Z

AbstractFieldHighlighter: order methods more sensibly; renamed a couple.

commit d4714a04a3e41d5e95bbe942b275c32ed69b9c2e
Author: David Smiley 
Date:   2016-09-04T01:03:29Z

Improve javadocs and @lucene.external/internal labeling & scope.
"ant precommit" now passes.

commit e0659f18a59bf2893076da6d7643ff30f2fa5a52
Author: David Smiley 
Date:   2016-09-04T01:25:55Z

Analysis: remove dubious filter() method

commit ccd7ce707bff2c06da89b31853cca9aecea72008
Author: David Smiley 
Date:   2016-09-04T01:44:01Z

getStrictPhraseHelper -> rm "Strict", getHighlightAccuracy -> getFlags, and 
only call filterExtractedTerms once.

commit ffc2a22c700b8abcbf87673d5d05bb3659d177c9
Author: David Smiley 
Date:   2016-09-04T15:21:08Z

UnifiedHighlighter round 2 (#2)

* AbstractFieldHighlighter: order methods more sensibly; renamed a couple.

* Improve javadocs and @lucene.external/internal labeling & scope.
"ant precommit" now passes.

* Analysis: remove dubious filter() method

* getStrictPhraseHelper -> rm "Strict", getHighlightAccuracy -> getFlags, 
and only call filterExtractedTerms once.

commit 5f95e05595db462d3ab5bffc68c2c92f70875072
Author: David Smiley 
Date:   2016-09-04T16:12:33Z

Refactor: FieldOffsetStrategy

commit 86fb6265fbbdb955ead6d4baf944bf708175715e
Author: David Smiley 
Date:   2016-09-04T16:21:32Z

stop passing maxPassages into highlightFieldForDoc()

commit f6fd80544eae9fab953b94b1e9346c0883f956eb
Author: David Smiley 
Date:   2016-09-04T16:12:33Z

Refactor: FieldOffsetStrategy

commit b335a673c2ce45904890c1e9af7cbfda2bd27b0f
Author: David Smiley 
Date:   2016-09-04T16:21:32Z

stop passing maxPassages into highlightFieldForDoc()

commit 478db9437b92214cbf459f82ba2e3a67c966a150
Author: David Smiley 
Date:   2016-09-04T18:29:44Z

Rename subclasses of FieldOffsetStrategy.

commit dbf4280755c11420a5032445cd618fadb7444b61
Author: David Smiley 
Date:   2016-09-04T18:31:34Z

Re-order and harmonize params on methods called by UH.getFieldHighlighter()

commit f0340e27e61dcda2e11992f08ec07a72fad6c24c
Author: David Smiley 
Date:   2016-09-04T18:53:51Z

FieldHighlighter: harmonize field/param order. And don't apply 
maxNoHighlightPasses twice.

commit 817f63c1d48fd523c13b9c40a2ae9b8a4047209a
Author: Timothy Rodriguez 
Date:   2016-09-06T20:43:20Z

Merge of renaming changes

commit 0f644a4f53c1ed4d41d562848f6fe51a87442a75
Author: Timothy Rodriguez 
Date:   2016-09-06T20:54:13Z

add visibility tests

commit 9171f49e117085e7d086267bb73836831ff07f8e
Author: Timothy Rodriguez 
Date:   2016-09-07T14:26:59Z

ADd additional extensibility test

commit 7ce488147cb811e15cb6e9125a835171157746f2
Author: Timothy Rodriguez 
Date:   2016-09-28T22:04:15Z

Reduce visibility of MultiTermHighlighting to package protected

commit 2f08465020448592b0e8750db568ade5a9218267
Author: Timothy M. Rodriguez 
Date:   2016-10-11T16:44:29Z

Initial commit that will use memory index to generate offsets enum if the 
tokenstream is null

commit 357f3dfb9ace4deef20787af19bc2e5a6b4ff61e
Author: Timothy M. Rodriguez