[
https://issues.apache.org/jira/browse/LUCENE-7603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15788036#comment-15788036
]
ASF GitHub Bot commented on LUCENE-7603:
----------------------------------------
Github user dsmiley commented on a diff in the pull request:
https://github.com/apache/lucene-solr/pull/129#discussion_r94243010
--- Diff:
lucene/core/src/java/org/apache/lucene/util/graph/GraphTokenStreamFiniteStrings.java
---
@@ -210,85 +199,41 @@ private void finish() {
*/
private void finish(int maxDeterminizedStates) {
Automaton automaton = builder.finish();
-
- // System.out.println("before det:\n" + automaton.toDot());
-
- Transition t = new Transition();
-
- // TODO: should we add "eps back to initial node" for all states,
- // and det that? then we don't need to revisit initial node at
- // every position? but automaton could blow up? And, this makes it
- // harder to skip useless positions at search time?
-
- if (anyTermID != -1) {
-
- // Make sure there are no leading or trailing ANY:
- int count = automaton.initTransition(0, t);
- for (int i = 0; i < count; i++) {
- automaton.getNextTransition(t);
- if (anyTermID >= t.min && anyTermID <= t.max) {
- throw new IllegalStateException("automaton cannot lead with an
ANY transition");
- }
- }
-
- int numStates = automaton.getNumStates();
- for (int i = 0; i < numStates; i++) {
- count = automaton.initTransition(i, t);
- for (int j = 0; j < count; j++) {
- automaton.getNextTransition(t);
- if (automaton.isAccept(t.dest) && anyTermID >= t.min &&
anyTermID <= t.max) {
- throw new IllegalStateException("automaton cannot end with an
ANY transition");
- }
- }
- }
-
- int termCount = termToID.size();
-
- // We have to carefully translate these transitions so automaton
- // realizes they also match all other terms:
- Automaton newAutomaton = new Automaton();
- for (int i = 0; i < numStates; i++) {
- newAutomaton.createState();
- newAutomaton.setAccept(i, automaton.isAccept(i));
- }
-
- for (int i = 0; i < numStates; i++) {
- count = automaton.initTransition(i, t);
- for (int j = 0; j < count; j++) {
- automaton.getNextTransition(t);
- int min, max;
- if (t.min <= anyTermID && anyTermID <= t.max) {
- // Match any term
- min = 0;
- max = termCount - 1;
- } else {
- min = t.min;
- max = t.max;
- }
- newAutomaton.addTransition(t.source, t.dest, min, max);
- }
- }
- newAutomaton.finishState();
- automaton = newAutomaton;
- }
-
det = Operations.removeDeadStates(Operations.determinize(automaton,
maxDeterminizedStates));
}
- private int getTermID(BytesRef term) {
- Integer id = termToID.get(term);
- if (id == null) {
- id = termToID.size();
- if (term != null) {
- term = BytesRef.deepCopyOf(term);
- }
- termToID.put(term, id);
+ /**
+ * Gets an integer id for a given term.
+ *
+ * If there is no position gaps for this token then we can reuse the id
for the same term if it appeared at another
+ * position without a gap. If we have a position gap generate a new id
so we can keep track of the position
+ * increment.
+ */
+ private int getTermID(int incr, int prevIncr, BytesRef term) {
+ assert term != null;
+ boolean isStackedGap = incr == 0 && prevIncr > 1;
+ boolean hasGap = incr > 1;
+ term = BytesRef.deepCopyOf(term);
--- End diff --
The deepCopyOf is only needed if you generate a new ID, not for an existing
one.
BTW... have you seen BytesRefHash? I think re-using that could minimize
the code here to deal with this stuff.
> Support Graph Token Streams in QueryBuilder
> -------------------------------------------
>
> Key: LUCENE-7603
> URL: https://issues.apache.org/jira/browse/LUCENE-7603
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/queryparser, core/search
> Reporter: Matt Weber
>
> With [LUCENE-6664|https://issues.apache.org/jira/browse/LUCENE-6664] we can
> use multi-term synonyms query time. A "graph token stream" will be created
> which which is nothing more than using the position length attribute on
> stacked tokens to indicate how many positions a token should span. Currently
> the position length attribute on tokens is ignored during query parsing.
> This issue will add support for handling these graph token streams inside the
> QueryBuilder utility class used by query parsers.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]