RE: A few questions regarding multi-word synonyms and parameters encoding

Ard Schrijvers Thu, 12 Jul 2007 02:41:06 -0700

Hello,

> 
> but honestly i haven't relaly tried anything like this ... 
> the code for
> parsing the synonyms.txt file probaly splits the individual 
> synonyms on
> whitespace to prodce multiple tokens which might screw you up 
> ... you may
> need to get creative (perhaps use a PatternReplaceFilter to 
> encode your
> spaces as "_" before hte SynonymFilter and then another one 
> to convert the
> "_" back to " " after the Synonym filter ... kludgy but it might work)


I had to build exactly this recently, but without solr and only lucene. I chose 
to create a CompressFilter as the last filter, to reduce all tokens into one 
single token (since it were facet fields i do know there where only a couple of 
tokens, and not thousands, because then compressing them in a single token 
might be a problem (not sure))

So for building synonyms on facet fields which can contain multiple tokens, I 
would add your own SynonymAnalyzer, that compresses tokens and when a 
compressed token is found in a synonym map, replace the token with the synonym. 

So, in your SynonymAnalyzer something like

private Map synonyms; // initialize it

public TokenStream tokenStream(String fieldName, Reader reader) {
        
        TokenStream result = super.tokenStream(fieldName, reader);
        if(fieldName.equals("synonym_field")){
            result = new CompressFilter(result,synonyms);
        }
        else if(fieldName.equals("compressed_field")){
            result = new CompressFilter(result);
        }
        return result; 
}

and your CompressFilter
  
 
public CompressFilter(TokenStream in, Map synonyms) {
      super(in);
      this.synonyms = synonyms;
    }
     
    public CompressFilter(TokenStream in) {
        super(in);
    } 
public Token next() throws IOException {
        Token t = input.next();
        if(t==null){
            return null;
        }
        StringBuffer sb = new StringBuffer();
        while(t!=null){
            sb.append(t.termText());
            t = input.next();
        }
        
        if(synonyms!=null){
            if(synonyms.containsKey(sb.toString())){
                sb = new StringBuffer( (String)synonyms.get(sb.toString()) );
            }
            else{
                return null; // synonym not found
            }
        }
        return new Token(sb.toString(), 0, sb.toString().length());
    }

I am not sure though how easy it is to put this in solr, but i suppose it isn't 
hard. Obviously, I am not sure what happens with the CompressFilter when there 
are *many* tokens in the "synonym_field" field.


Regards Ard


> 
> : Now I want create a link for each of these value so that 
> the user can filter
> : the results by that title by clicking on the link. For 
> example, if I click
> : on "Software Engineer", the results are now narrowed down 
> to just include
> : records with "Software Engineer" in their title. Since 
> "title" field can
> : contain special chars like '+', '&' ..., I really can't 
> find a clean way to
> : do this. At the moment, I replace all the space by '+' and 
> it seems to work
> : for words like "Software engineer" (converted to 
> "Software+Engineer").
> : However, "C++ Programmer" is converted to "C+++Programmer", 
> and it doesn't
> : seem to work (return no results). Any ideas?
> 
> for starters you need to URL encode *all* of hte characters, 
> not just the
> spaces ... space escapes to "+" but only becuase "+" escapes to %2B.
> 
> second, if you are dealing with multi-word values like this in your
> facets, you need to make sure to quote them when doing fq queries to
> (before url encoding) ... so if you have a facet.field 
> "skills" that lists
> "C++ Programmer" as the value, the fq query you want to use 
> would be...
>      skills:"C++ Programmer"
> 
> when you URL encode that it should become...
> 
>      fq=skills%3A%22C%2B%2B+Programmer%22
> 
> ...use teh echoParams=explicit&debugQuery=true params to see 
> exactly what
> your params look like when they've been URL decoded and what your
> query objects look like once they've been parsed.
> 
> 
> 
> -Hoss
> 
>

RE: A few questions regarding multi-word synonyms and parameters encoding

Reply via email to