[jira] Updated: (LUCENE-1320) ShingleMatrixFilter, a three dimensional permutating shingle filter

Karl Wettin (JIRA) Mon, 30 Jun 2008 17:49:07 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Karl Wettin updated LUCENE-1320:
--------------------------------

    Attachment: LUCENE-1320.txt

This works pretty well, I'll commit it soon.

 * javadocs
 * improved default shingle token weights (still not that great)

Also optimized and refactored some that resulted in nicer looking code in the 
tests and:

 * PrefixAwareTokenFilter
 * PrefixAndSuffixAwareTokenFilter
 * SingleTokenTokenStream

{code:java}
/**
 * Joins two token streams and leaves the last token of the prefix stream 
available
 * to be used when updating the token values in the second stream based on that 
token.
 */
public class PrefixAwareTokenFilter extends TokenStream {
  /** The default implementation adds last prefix token end offset to the 
suffix token start and end offsets. */
  public Token updateSuffixToken(Token suffixToken, Token lastPrefixToken) {

{code}

{code:java}
/** Links two PrefixAndSuffixAwareTokenFilter */  
public class PrefixAndSuffixAwareTokenFilter extends TokenStream {
  public Token updateInputToken(Token inputToken, Token lastPrefixToken) {
  public Token updateSuffixToken(Token suffixToken, Token lastInputToken) {
{code}




> ShingleMatrixFilter, a three dimensional permutating shingle filter
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1320
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1320
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 2.3.2
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>         Attachments: LUCENE-1320.txt, LUCENE-1320.txt
>
>
> Backed by a column focused matrix that creates all permutations of shingle 
> tokens in three dimensions. I.e. it handles multi token synonyms.
> Could for instance in some cases be used to replaces 0-slop phrase queries 
> with something speedier.
> {code:java}
> Token[][][]{
>   {{hello}, {greetings, and, salutations}},
>   {{world}, {earth}, {tellus}}
> }
> {code}
> passes the following test  with 2-3 grams:
> {code:java}
> assertNext(ts, "hello_world");
> assertNext(ts, "greetings_and");
> assertNext(ts, "greetings_and_salutations");
> assertNext(ts, "and_salutations");
> assertNext(ts, "and_salutations_world");
> assertNext(ts, "salutations_world");
> assertNext(ts, "hello_earth");
> assertNext(ts, "and_salutations_earth");
> assertNext(ts, "salutations_earth");
> assertNext(ts, "hello_tellus");
> assertNext(ts, "and_salutations_tellus");
> assertNext(ts, "salutations_tellus");
> {code}
> Contains more and less complex tests that demonstrate offsets, posincr, 
> payload boosts calculation and construction of a matrix from a token stream.
> The matrix attempts to hog as little memory as possible by seeking no more 
> than maximumShingleSize columns forward in the stream and clearing up unused 
> resources (columns and unique token sets). Can still be optimized quite a bit 
> though.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1320) ShingleMatrixFilter, a three dimensional permutating shingle filter

Reply via email to