[ 
https://issues.apache.org/jira/browse/LUCENE-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12609694#action_12609694
 ] 

Steven Rowe commented on LUCENE-1320:
-------------------------------------

Hi Karl,

The classes you introduce here look interesting, but the documentation is very 
sparse.
  
Things I think should be addressed in the documentation:

* Where would you see this stuff being used - on the query side or the indexing 
side?  Or both? 
* Where would matrix come from in a real-world scenario?  It looks like there 
are (at least) three mechanisms for constructing the matrix - which one(s) make 
sense where?
* What do payloads have to do with the whole thing?  (Looks like weight?  
ShingleMatrixFilter.calculateShingleWeight() should be explained at the class 
level - since it's public, I assume you mean for it to be overridable?)
* The various ShingleMatrixFilter constructors should have javadoc explaining 
their use.
* This class's use of the new flags feature looks interesting - a discussion in 
the documentation would be useful for future implementations.

A couple of random notes:

* Missing Apache license declarations: PrefixAndSuffixAwareTokenFilter.java and 
TestPrefixAndSuffixAwareTokenFilter.java
* Since you only use SingleTokenTokenStream in your tests, and since it likely 
will only ever be used in tests, I think it should be moved from src/java/ to 
src/test/.
* TestShingleMatrixFilter.TokenListStream looks generally useful for testing 
filters - maybe this could be pulled out as a separate class, maybe into the 
o.a.l.analysis.miscellaneous package?
* On line #83 of TestShingleMatrixFilter, it looks like the first assignment to 
ts could be removed:

{code:java}
83:   ts = tls;
84:   ts = new ShingleMatrixFilter(ts, 2, 2, null);
{code}


> ShingleMatrixFilter, a three dimensional permutating shingle filter
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1320
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1320
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 2.3.2
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>         Attachments: LUCENE-1320.txt, LUCENE-1320.txt
>
>
> Backed by a column focused matrix that creates all permutations of shingle 
> tokens in three dimensions. I.e. it handles multi token synonyms.
> Could for instance in some cases be used to replaces 0-slop phrase queries 
> with something speedier.
> {code:java}
> Token[][][]{
>   {{hello}, {greetings, and, salutations}},
>   {{world}, {earth}, {tellus}}
> }
> {code}
> passes the following test  with 2-3 grams:
> {code:java}
> assertNext(ts, "hello_world");
> assertNext(ts, "greetings_and");
> assertNext(ts, "greetings_and_salutations");
> assertNext(ts, "and_salutations");
> assertNext(ts, "and_salutations_world");
> assertNext(ts, "salutations_world");
> assertNext(ts, "hello_earth");
> assertNext(ts, "and_salutations_earth");
> assertNext(ts, "salutations_earth");
> assertNext(ts, "hello_tellus");
> assertNext(ts, "and_salutations_tellus");
> assertNext(ts, "salutations_tellus");
> {code}
> Contains more and less complex tests that demonstrate offsets, posincr, 
> payload boosts calculation and construction of a matrix from a token stream.
> The matrix attempts to hog as little memory as possible by seeking no more 
> than maximumShingleSize columns forward in the stream and clearing up unused 
> resources (columns and unique token sets). Can still be optimized quite a bit 
> though.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to