[GitHub] lucene-solr pull request #129: LUCENE-7603: Support Graph Token Streams in Q...

dsmiley Fri, 30 Dec 2016 09:27:14 -0800

Github user dsmiley commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/129#discussion_r94243010
  
    --- Diff: 
lucene/core/src/java/org/apache/lucene/util/graph/GraphTokenStreamFiniteStrings.java
 ---
    @@ -210,85 +199,41 @@ private void finish() {
        */
       private void finish(int maxDeterminizedStates) {
         Automaton automaton = builder.finish();
    -
    -    // System.out.println("before det:\n" + automaton.toDot());
    -
    -    Transition t = new Transition();
    -
    -    // TODO: should we add "eps back to initial node" for all states,
    -    // and det that?  then we don't need to revisit initial node at
    -    // every position?  but automaton could blow up?  And, this makes it
    -    // harder to skip useless positions at search time?
    -
    -    if (anyTermID != -1) {
    -
    -      // Make sure there are no leading or trailing ANY:
    -      int count = automaton.initTransition(0, t);
    -      for (int i = 0; i < count; i++) {
    -        automaton.getNextTransition(t);
    -        if (anyTermID >= t.min && anyTermID <= t.max) {
    -          throw new IllegalStateException("automaton cannot lead with an 
ANY transition");
    -        }
    -      }
    -
    -      int numStates = automaton.getNumStates();
    -      for (int i = 0; i < numStates; i++) {
    -        count = automaton.initTransition(i, t);
    -        for (int j = 0; j < count; j++) {
    -          automaton.getNextTransition(t);
    -          if (automaton.isAccept(t.dest) && anyTermID >= t.min && 
anyTermID <= t.max) {
    -            throw new IllegalStateException("automaton cannot end with an 
ANY transition");
    -          }
    -        }
    -      }
    -
    -      int termCount = termToID.size();
    -
    -      // We have to carefully translate these transitions so automaton
    -      // realizes they also match all other terms:
    -      Automaton newAutomaton = new Automaton();
    -      for (int i = 0; i < numStates; i++) {
    -        newAutomaton.createState();
    -        newAutomaton.setAccept(i, automaton.isAccept(i));
    -      }
    -
    -      for (int i = 0; i < numStates; i++) {
    -        count = automaton.initTransition(i, t);
    -        for (int j = 0; j < count; j++) {
    -          automaton.getNextTransition(t);
    -          int min, max;
    -          if (t.min <= anyTermID && anyTermID <= t.max) {
    -            // Match any term
    -            min = 0;
    -            max = termCount - 1;
    -          } else {
    -            min = t.min;
    -            max = t.max;
    -          }
    -          newAutomaton.addTransition(t.source, t.dest, min, max);
    -        }
    -      }
    -      newAutomaton.finishState();
    -      automaton = newAutomaton;
    -    }
    -
         det = Operations.removeDeadStates(Operations.determinize(automaton, 
maxDeterminizedStates));
       }
     
    -  private int getTermID(BytesRef term) {
    -    Integer id = termToID.get(term);
    -    if (id == null) {
    -      id = termToID.size();
    -      if (term != null) {
    -        term = BytesRef.deepCopyOf(term);
    -      }
    -      termToID.put(term, id);
    +  /**
    +   * Gets an integer id for a given term.
    +   *
    +   * If there is no position gaps for this token then we can reuse the id 
for the same term if it appeared at another
    +   * position without a gap.  If we have a position gap generate a new id 
so we can keep track of the position
    +   * increment.
    +   */
    +  private int getTermID(int incr, int prevIncr, BytesRef term) {
    +    assert term != null;
    +    boolean isStackedGap = incr == 0 && prevIncr > 1;
    +    boolean hasGap = incr > 1;
    +    term = BytesRef.deepCopyOf(term);
    --- End diff --
    
    The deepCopyOf is only needed if you generate a new ID, not for an existing 
one.  
    
    BTW... have you seen BytesRefHash?  I think re-using that could minimize 
the code here to deal with this stuff.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[GitHub] lucene-solr pull request #129: LUCENE-7603: Support Graph Token Streams in Q...

Reply via email to