[ https://issues.apache.org/jira/browse/LUCENE-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12609694#action_12609694 ]
Steven Rowe commented on LUCENE-1320: ------------------------------------- Hi Karl, The classes you introduce here look interesting, but the documentation is very sparse. Things I think should be addressed in the documentation: * Where would you see this stuff being used - on the query side or the indexing side? Or both? * Where would matrix come from in a real-world scenario? It looks like there are (at least) three mechanisms for constructing the matrix - which one(s) make sense where? * What do payloads have to do with the whole thing? (Looks like weight? ShingleMatrixFilter.calculateShingleWeight() should be explained at the class level - since it's public, I assume you mean for it to be overridable?) * The various ShingleMatrixFilter constructors should have javadoc explaining their use. * This class's use of the new flags feature looks interesting - a discussion in the documentation would be useful for future implementations. A couple of random notes: * Missing Apache license declarations: PrefixAndSuffixAwareTokenFilter.java and TestPrefixAndSuffixAwareTokenFilter.java * Since you only use SingleTokenTokenStream in your tests, and since it likely will only ever be used in tests, I think it should be moved from src/java/ to src/test/. * TestShingleMatrixFilter.TokenListStream looks generally useful for testing filters - maybe this could be pulled out as a separate class, maybe into the o.a.l.analysis.miscellaneous package? * On line #83 of TestShingleMatrixFilter, it looks like the first assignment to ts could be removed: {code:java} 83: ts = tls; 84: ts = new ShingleMatrixFilter(ts, 2, 2, null); {code} > ShingleMatrixFilter, a three dimensional permutating shingle filter > ------------------------------------------------------------------- > > Key: LUCENE-1320 > URL: https://issues.apache.org/jira/browse/LUCENE-1320 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers > Affects Versions: 2.3.2 > Reporter: Karl Wettin > Assignee: Karl Wettin > Attachments: LUCENE-1320.txt, LUCENE-1320.txt > > > Backed by a column focused matrix that creates all permutations of shingle > tokens in three dimensions. I.e. it handles multi token synonyms. > Could for instance in some cases be used to replaces 0-slop phrase queries > with something speedier. > {code:java} > Token[][][]{ > {{hello}, {greetings, and, salutations}}, > {{world}, {earth}, {tellus}} > } > {code} > passes the following test with 2-3 grams: > {code:java} > assertNext(ts, "hello_world"); > assertNext(ts, "greetings_and"); > assertNext(ts, "greetings_and_salutations"); > assertNext(ts, "and_salutations"); > assertNext(ts, "and_salutations_world"); > assertNext(ts, "salutations_world"); > assertNext(ts, "hello_earth"); > assertNext(ts, "and_salutations_earth"); > assertNext(ts, "salutations_earth"); > assertNext(ts, "hello_tellus"); > assertNext(ts, "and_salutations_tellus"); > assertNext(ts, "salutations_tellus"); > {code} > Contains more and less complex tests that demonstrate offsets, posincr, > payload boosts calculation and construction of a matrix from a token stream. > The matrix attempts to hog as little memory as possible by seeking no more > than maximumShingleSize columns forward in the stream and clearing up unused > resources (columns and unique token sets). Can still be optimized quite a bit > though. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]