[ https://issues.apache.org/jira/browse/LUCENE-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12609986#action_12609986 ]
Karl Wettin commented on LUCENE-1320: ------------------------------------- The for the comments Steve! I'll pop a more documented patch in soon. Here are my replies: {quote} Where would you see this stuff being used - on the query side or the indexing side? Or both? {quote} Historically I've used shingles at both ends to replace phrase queries and to fix word de/composition problems. This implementation was however written to tokenize the 20 news groups data for the cluster example in Mahout. {quote} Where would matrix come from in a real-world scenario? It looks like there are (at least) three mechanisms for constructing the matrix - which one(s) make sense where? {quote} >From a token stream. You need to implement a TokenSettingsCodec to tell the >shingle filter how to position an input token in the matrix: in a new column, >a new row or in the same row. It is also used to define how to get and set a >weight float of a token. {code:java} /** * Using this codec makes a [EMAIL PROTECTED] ShingleMatrixFilter} act like [EMAIL PROTECTED] ShingleFilter}. * It produces the most simple sort of shingles, ignoring token position increments, et c. * * It adds each token as a new column. */ public static class OneDimensionalNonWeightedTokenSettingsCodec extends TokenSettingsCodec { /** * A codec that creates a two dimensional matrix * by treating tokens from the input stream with 0 position increment * as new rows to the current column. */ public static class TwoDimensionalNonWeightedSynonymTokenSettingsCodec extends TokenSettingsCodec { /** * A full featured codec not to be used for something serious. * * It takes complete control of * payload for weight * and the bit flags for positioning in the matrix. * * Mainly exist for demonstrational purposes. */ public static class SimpleThreeDimensionalTokenSettingsCodec extends TokenSettingsCodec { {code} {quote} What do payloads have to do with the whole thing? (Looks like weight? ShingleMatrixFilter.calculateShingleWeight() should be explained at the class level - since it's public, I assume you mean for it to be overridable?) {quote} Yeah, it's weights. They can be used either at query time or index time. Or both for that sake. You could for instance want to be producing a matrix with all sort of weighted data in synonym space: stems, stems without diactits, source tokens without diacrits, et c. Then you'd expect to see the weight difference in the shingles too. Weights are turned off by always returning 1f at getWeight and ignore calls to setWeight in your TokenSettingsCodec. {quote} Since you only use SingleTokenTokenStream in your tests, and since it likely will only ever be used in tests, I think it should be moved from src/java/ to src/test/. {quote} That's actually a real use case in the test. When replacing phrase queries with shingles you might want to boost the edges by adding (boosted) prefix and suffix tokens at index and query time: {code:java} ts = new PrefixAndSuffixAwareTokenFilter(new SingleTokenTokenStream(tokenFactory("^", 1, 100f, 0, 0)), tls, new SingleTokenTokenStream(tokenFactory("$", 1, 50f, 0, 0))); assertNext(ts, "^_hello", 1, 10.049875f, 0, 4); assertNext(ts, "^_greetings", 1, 10.049875f, 0, 4); assertNext(ts, "hello_world", 1, 1.4142135f, 0, 10); assertNext(ts, "greetings_world", 1, 1.4142135f, 0, 10); assertNext(ts, "hello_earth", 1, 1.4142135f, 0, 10); assertNext(ts, "greetings_earth", 1, 1.4142135f, 0, 10); assertNext(ts, "hello_tellus", 1, 1.4142135f, 0, 10); assertNext(ts, "greetings_tellus", 1, 1.4142135f, 0, 10); assertNext(ts, "world_$", 1, 7.1414285f, 5, 10); assertNext(ts, "earth_$", 1, 7.1414285f, 5, 10); assertNext(ts, "tellus_$", 1, 7.1414285f, 5, 10); assertNull(ts.next()); {code} As you can see, the default weight calculating is sort of messed up. I'd prefere to see more impact from the weight of the prefix and the suffix token. It's not too bad though. {quote} The various ShingleMatrixFilter constructors should have javadoc explaining their use. {quote} I'll do that, but the names of the constructor parameters are rather self explainatory. It would just be a {quote} This class's use of the new flags feature looks interesting - a discussion in the documentation would be useful for future implementations. {quote} It's rather terrible, I use the int as a state instead of the intended bitset level. It's just for demonstrational purposes though. {quote} TestShingleMatrixFilter.TokenListStream looks generally useful for testing filters - maybe this could be pulled out as a separate class, maybe into the o.a.l.analysis.miscellaneous package? {quote} Or perhaps the CachingTokenFilter could be rewritten to accept a token collection in the constructor. > ShingleMatrixFilter, a three dimensional permutating shingle filter > ------------------------------------------------------------------- > > Key: LUCENE-1320 > URL: https://issues.apache.org/jira/browse/LUCENE-1320 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers > Affects Versions: 2.3.2 > Reporter: Karl Wettin > Assignee: Karl Wettin > Attachments: LUCENE-1320.txt, LUCENE-1320.txt > > > Backed by a column focused matrix that creates all permutations of shingle > tokens in three dimensions. I.e. it handles multi token synonyms. > Could for instance in some cases be used to replaces 0-slop phrase queries > with something speedier. > {code:java} > Token[][][]{ > {{hello}, {greetings, and, salutations}}, > {{world}, {earth}, {tellus}} > } > {code} > passes the following test with 2-3 grams: > {code:java} > assertNext(ts, "hello_world"); > assertNext(ts, "greetings_and"); > assertNext(ts, "greetings_and_salutations"); > assertNext(ts, "and_salutations"); > assertNext(ts, "and_salutations_world"); > assertNext(ts, "salutations_world"); > assertNext(ts, "hello_earth"); > assertNext(ts, "and_salutations_earth"); > assertNext(ts, "salutations_earth"); > assertNext(ts, "hello_tellus"); > assertNext(ts, "and_salutations_tellus"); > assertNext(ts, "salutations_tellus"); > {code} > Contains more and less complex tests that demonstrate offsets, posincr, > payload boosts calculation and construction of a matrix from a token stream. > The matrix attempts to hog as little memory as possible by seeking no more > than maximumShingleSize columns forward in the stream and clearing up unused > resources (columns and unique token sets). Can still be optimized quite a bit > though. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]