[ 
https://issues.apache.org/jira/browse/LUCENE-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12609986#action_12609986
 ] 

Karl Wettin commented on LUCENE-1320:
-------------------------------------

The for the comments Steve! I'll pop a more documented patch in soon. Here are 
my replies:

{quote}
Where would you see this stuff being used - on the query side or the indexing 
side?  Or both?
{quote}

Historically I've used shingles at both ends to replace phrase queries and to 
fix word de/composition problems. This implementation was however written to 
tokenize the 20 news groups data for the cluster example in Mahout.

{quote}
Where would matrix come from in a real-world scenario?  It looks like there are 
(at least) three mechanisms for constructing the matrix - which one(s) make 
sense where?
{quote}

>From a token stream. You need to implement a TokenSettingsCodec to tell the 
>shingle filter how to position an input token in the matrix: in a new column, 
>a new row or in the same row. It is also used to define how to get and set a 
>weight float of a token. 

{code:java}
  /**
   * Using this codec makes a [EMAIL PROTECTED] ShingleMatrixFilter} act like 
[EMAIL PROTECTED] ShingleFilter}.
   * It produces the most simple sort of shingles, ignoring token position 
increments, et c.
   * 
   * It adds each token as a new column.
   */
  public static class OneDimensionalNonWeightedTokenSettingsCodec extends 
TokenSettingsCodec {

  /**
   * A codec that creates a two dimensional matrix 
   * by treating tokens from the input stream with 0 position increment 
   * as new rows to the current column.
   */
  public static class TwoDimensionalNonWeightedSynonymTokenSettingsCodec 
extends TokenSettingsCodec {

  /**
   * A full featured codec not to be used for something serious.
   *
   * It takes complete control of 
   * payload for weight
   * and the bit flags for positioning in the matrix.
   * 
   * Mainly exist for demonstrational purposes.
   */
  public static class SimpleThreeDimensionalTokenSettingsCodec extends 
TokenSettingsCodec {
{code}



{quote} 
What do payloads have to do with the whole thing?  (Looks like weight?
ShingleMatrixFilter.calculateShingleWeight() should be explained at the class 
level - since it's public, I assume you mean for it to be overridable?)
{quote}

Yeah, it's weights. They can be used either at query time or index time. Or 
both for that sake. You could for instance want to be producing a matrix with 
all sort of weighted data in synonym space: stems, stems without diactits, 
source tokens without diacrits, et c. Then you'd expect to see the weight 
difference in the shingles too. 

Weights are turned off by always returning 1f at getWeight and ignore calls to 
setWeight in your TokenSettingsCodec.  

{quote}
Since you only use SingleTokenTokenStream in your tests, and since it likely 
will only ever be used in tests, I think it should be moved from src/java/ to 
src/test/.
{quote}

That's actually a real use case in the test. When replacing phrase queries with 
shingles you might want to boost the edges by adding (boosted) prefix and 
suffix tokens at index and query time:

{code:java}
ts = new PrefixAndSuffixAwareTokenFilter(new 
SingleTokenTokenStream(tokenFactory("^", 1, 100f, 0, 0)), tls, new 
SingleTokenTokenStream(tokenFactory("$", 1, 50f, 0, 0)));

assertNext(ts, "^_hello", 1, 10.049875f, 0, 4);
assertNext(ts, "^_greetings", 1, 10.049875f, 0, 4);
assertNext(ts, "hello_world", 1, 1.4142135f, 0, 10);
assertNext(ts, "greetings_world", 1, 1.4142135f, 0, 10);
assertNext(ts, "hello_earth", 1, 1.4142135f, 0, 10);
assertNext(ts, "greetings_earth", 1, 1.4142135f, 0, 10);
assertNext(ts, "hello_tellus", 1, 1.4142135f, 0, 10);
assertNext(ts, "greetings_tellus", 1, 1.4142135f, 0, 10);
assertNext(ts, "world_$", 1, 7.1414285f, 5, 10);
assertNext(ts, "earth_$", 1, 7.1414285f, 5, 10);
assertNext(ts, "tellus_$", 1, 7.1414285f, 5, 10);
assertNull(ts.next());
{code}

As you can see, the default weight calculating is sort of messed up. I'd 
prefere to see more impact from the weight of the prefix and the suffix token. 
It's not too bad though.

{quote}
The various ShingleMatrixFilter constructors should have javadoc explaining 
their use.
{quote}

I'll do that, but the names of the constructor parameters are rather self 
explainatory. It would just be a 

{quote}
This class's use of the new flags feature looks interesting - a discussion in 
the documentation would be useful for future implementations.
{quote}

It's rather terrible, I use the int as a state instead of the intended bitset 
level. It's just for demonstrational purposes though.

{quote}
TestShingleMatrixFilter.TokenListStream looks generally useful for testing 
filters - maybe this could be pulled out as a separate class, maybe into the 
o.a.l.analysis.miscellaneous package?
{quote}

Or perhaps the CachingTokenFilter could be rewritten to accept a token 
collection in the constructor.




> ShingleMatrixFilter, a three dimensional permutating shingle filter
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1320
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1320
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 2.3.2
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>         Attachments: LUCENE-1320.txt, LUCENE-1320.txt
>
>
> Backed by a column focused matrix that creates all permutations of shingle 
> tokens in three dimensions. I.e. it handles multi token synonyms.
> Could for instance in some cases be used to replaces 0-slop phrase queries 
> with something speedier.
> {code:java}
> Token[][][]{
>   {{hello}, {greetings, and, salutations}},
>   {{world}, {earth}, {tellus}}
> }
> {code}
> passes the following test  with 2-3 grams:
> {code:java}
> assertNext(ts, "hello_world");
> assertNext(ts, "greetings_and");
> assertNext(ts, "greetings_and_salutations");
> assertNext(ts, "and_salutations");
> assertNext(ts, "and_salutations_world");
> assertNext(ts, "salutations_world");
> assertNext(ts, "hello_earth");
> assertNext(ts, "and_salutations_earth");
> assertNext(ts, "salutations_earth");
> assertNext(ts, "hello_tellus");
> assertNext(ts, "and_salutations_tellus");
> assertNext(ts, "salutations_tellus");
> {code}
> Contains more and less complex tests that demonstrate offsets, posincr, 
> payload boosts calculation and construction of a matrix from a token stream.
> The matrix attempts to hog as little memory as possible by seeking no more 
> than maximumShingleSize columns forward in the stream and clearing up unused 
> resources (columns and unique token sets). Can still be optimized quite a bit 
> though.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to