Re: One problem of using the lucene

jason Mon, 16 Jan 2006 21:14:35 -0800

Hi,

the following code is the SynonymFilter i wrote.



import org.apache.lucene.analysis.*;


import java.io.*;
import java.util.*;
/**
 * @author JIANG XING
 *
 * Jan 15, 2006
 */
public class SynonymFilter extends TokenFilter {

    public static final String TOKEN_TYPE_SYNONYM = "SYNONYM";

    private Stack synonymStack;
    private WordNetSynonymEngine engine;

    public SynonymFilter(TokenStream in, WordNetSynonymEngine engine){
        super(in);
        synonymStack = new Stack();
        this.engine = engine;
    }

    public Token next () throws IOException {
        if(synonymStack.size() > 0){
            return (Token) synonymStack.pop();
        }

        Token token = input.next();


        if(token == null){
            return null;
        }

        addAliasesToStack(token);

        return token;
    }

    private void addAliasesToStack(Token token) throws IOException {


        String [] synonyms = engine.getSynonyms(token.termText());

        if(synonyms == null) return;

        for(int i = 0; i < synonyms.length; i++) {
            Token synToken = new Token(synonyms[i], token.startOffset(),
token.endOffset(), TOKEN_TYPE_SYNONYM);

            synToken.setPositionIncrement(0); //

            synonymStack.push(synToken);
        }
    }
}
It is adding tokens into the same position as the original token. And then,
I used the QueryParser for searching and the snowball analyzer for parsing.

the following is the SynonymAnalyzer I wrote.

import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.analysis.snowball.*;

import java.io.*;
import java.util.*;

/**
 * @author JIANG XING
 *
 * Jan 15, 2006
 */
public class SynonymAnalyzer extends Analyzer {
    private WordNetSynonymEngine engine;
    private Set stopword;

    public SynonymAnalyzer(String [] word) {
        try{
        engine = new WordNetSynonymEngine(new
File("C:\\PDF2Text\\SearchEngine\\WordNetIndex"));
        stopword = StopFilter.makeStopSet(word);
        }catch(IOException e){
            e.printStackTrace();
        }
    }

    public TokenStream tokenStream(String fieldName, Reader reader){

        TokenStream result = new StandardTokenizer(reader);
        result = new StandardFilter(result);
        result = new LowerCaseFilter(result);
        if (stopword != null){
          result = new StopFilter(result, stopword);
        }

        result = new SnowballFilter(result, "Lovins");

        result = new SynonymFilter(result, engine);

        return result;
    }

}
I write some code in the snowballfitler (line 75-79). If i only used the
snowballfilter, the term "support" can be found in all the 17 documents.
However, if the code "result = new SynonymFilter(result, engine);" is used.
The term "support" cannot be found in some documents.


public class SnowballFilter extends TokenFilter {
  private static final Object [] EMPTY_ARGS = new Object[0];

  private SnowballProgram stemmer;
  private Method stemMethod;

  /** Construct the named stemming filter.
   *
   * @param in the input tokens to stem
   * @param name the name of a stemmer
   */
  public SnowballFilter(TokenStream in, String name) {
    super(in);
    try {
      Class stemClass =
        Class.forName("net.sf.snowball.ext." + name + "Stemmer");
      stemmer = (SnowballProgram) stemClass.newInstance();
      // why doesn't the SnowballProgram class have an (abstract?) stem
method?
      stemMethod = stemClass.getMethod("stem", new Class[0]);
    } catch (Exception e) {
      throw new RuntimeException(e.toString());
    }
  }

  /** Returns the next input Token, after being stemmed */
  public final Token next() throws IOException {
    Token token = input.next();
    if (token == null)
      return null;
    stemmer.setCurrent(token.termText());
    try {
      stemMethod.invoke(stemmer, EMPTY_ARGS);
    } catch (Exception e) {
      throw new RuntimeException(e.toString());
    }

    Token newToken = new Token(stemmer.getCurrent(),
                      token.startOffset(), token.endOffset(), token.type());
    //check the tokens.
    if(newToken.termText().equals("support")){
        System.out.println("the term support is found");
    }

    newToken.setPositionIncrement(token.getPositionIncrement());
    return newToken;
  }
}



On 1/16/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>
> Could you share the details of your SynonymFilter?  Is it adding
> tokens into the same position as the original tokens (position
> increment of 0)?   Are you using QueryParser for searching?  If so,
> try TermQuery to eliminate the parser's analysis from the picture for
> the time being while trouble shooting.
>
> If you are using QueryParser, are you using the same analyzer?  If
> this is the case, what is the .toString of the generated Query?
>
>        Erik
>
>
> On Jan 16, 2006, at 3:54 AM, jason wrote:
>
> > Hi,
> >
> > I got a problem of using the lucene.
> >
> > I write a SynonymFilter which can add synonyms from the WordNet.
> > Meanwhile,
> > i used the SnowballFilter for term stemming. However, i got a
> > problem when
> > combining the two fiters.
> >
> > For instance, i got 17 documents containing the Term "support"
> > and  the
> > following is the SynonymAnalyzer i wrote.
> >
> > /**
> > *
> > */
> >  public TokenStream tokenStream(String fieldName, Reader reader){
> >
> >
> >         TokenStream result = new StandardTokenizer(reader);
> >         result = new StandardFilter(result);
> >         result = new LowerCaseFilter(result);
> >         if (stopword != null){
> >           result = new StopFilter(result, stopword);
> >         }
> >
> >         result = new SnowballFilter(result, "Lovins");
> >
> >        result = new SynonymFilter(result, engine);
> >
> >         return result;
> >     }
> >
> > If i only used the SnowballFilter, i can find the "support" in the 17
> > documents. However, after adding the SynonymFilter, the "support"
> > can only
> > be found in 10 documents. It seems the term "support" cannot be
> > found in the
> > left 7 documents. I dont know what's wrong with it.
> >
> > regards
> >
> > jiang xing
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: One problem of using the lucene

Reply via email to