ShingleMatrixFilter, a three dimensional permutating shingle filter
-------------------------------------------------------------------

                 Key: LUCENE-1320
                 URL: https://issues.apache.org/jira/browse/LUCENE-1320
             Project: Lucene - Java
          Issue Type: New Feature
          Components: contrib/analyzers
    Affects Versions: 2.3.2
            Reporter: Karl Wettin
            Assignee: Karl Wettin


Backed by a column focused matrix that creates all permutations of shingle 
tokens in three dimensions. I.e. it handles multi token synonyms.

Could for instance in some cases be used to replaces 0-slop phrase queries with 
something speedier.

{code:java}
Token[][][]{
  {{hello}, {greetings, and, salutations}},
  {{world}, {earth}, {tellus}}
}
{code}

passes the following test  with 2-3 grams:

{code:java}
assertNext(ts, "hello_world");
assertNext(ts, "greetings_and");
assertNext(ts, "greetings_and_salutations");
assertNext(ts, "and_salutations");
assertNext(ts, "and_salutations_world");
assertNext(ts, "salutations_world");
assertNext(ts, "hello_earth");
assertNext(ts, "and_salutations_earth");
assertNext(ts, "salutations_earth");
assertNext(ts, "hello_tellus");
assertNext(ts, "and_salutations_tellus");
assertNext(ts, "salutations_tellus");
{code}

Contains more and less complex tests that demonstrate offsets, posincr, payload 
boosts calculation and construction of a matrix from a token stream.

The matrix attempts to hog as little memory as possible by seeking no more than 
maximumShingleSize columns forward in the stream and clearing up unused 
resources (columns and unique token sets). Can still be optimized quite a bit 
though.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to