Re: Searching Diacritics

Karl Wettin Mon, 27 Aug 2007 11:48:32 -0700


27 aug 2007 kl. 20.30 skrev anorman:

I've tried to implement an analyzer with little different then using:
result = new ISOLatin1AccentFilter(result); in the TokenStreammethod.
Everything appears to work, however my search will not work for anywordwith diacritics with that change. Without using it will find wordssuch as"cèdulas" but not "cedulas", with the change it will find neither,it just
appears to be stripping it out altogether.


Do you use the same analyzer when searching as when creating the index?

--
karl


Any suggestions?




anorman wrote:

Can I do this at search time rather than index time? Below is mycode

that is handling the searching, where would I utilize such a filter?

Thanks for the help!




package search.lucene.search;
import org.apache.lucene.document.Document;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.ISOLatin1AccentFilter;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;

import search.lucene.index.IndexManager;

/**
 * This class is used to search the
 * Lucene index and return search results
 */

public class SearchManager {


private String searchWord;

    private IndexManager indexManager;

    private Analyzer analyzer;

    public SearchManager(String searchWord){
        this.searchWord   = searchWord;
        this.indexManager = new IndexManager();
        this.analyzer = new StandardAnalyzer();
    }

    /**
     * do search
     */
    public List search(){
        List searchResult = new ArrayList();
                
        IndexSearcher indexSearcher = null;

        try{

indexSearcher = new IndexSearcher(indexManager.getIndexDir());

        }catch(IOException ioe){
            ioe.printStackTrace();
        }

QueryParser queryParser = new QueryParser("content",analyzer);

        Query query = null;
        try {
            query = queryParser.parse(searchWord);
        } catch (ParseException e) {
          e.printStackTrace();
        }

        if(null != query && null != indexSearcher){                     
            try {
                Hits hits = indexSearcher.search(query);
                for(int i = 0; i < hits.length(); i ++){
                                        
                                        Document doc = hits.doc(i);
                                System.out.println(doc.get("filename"));

                                        SearchResultBean resultBean = new 
SearchResultBean();


resultBean.setXMLId(hits.doc(i).get("id"));
                                        
resultBean.setXMLTitle(hits.doc(i).get("title"));
                                        
resultBean.setXMLAuthor(hits.doc(i).get("author"));
                                        
resultBean.setXMLAbstract(hits.doc(i).get("abstract"));
                                        resultBean.setScore(hits.score(i));
                                        
                                        searchResult.add(resultBean);
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        return searchResult;

    }






thomas arni-2 wrote:

You can extend the DefaultAnalyzer.
The only thing you have to do, is to rewrite the methodtokenStream like
this:

  /** Constructs a [EMAIL PROTECTED] StandardTokenizer} filtered by a [EMAIL 
PROTECTED]
StandardFilter}, a [EMAIL PROTECTED] LowerCaseFilter} and a [EMAIL PROTECTED]StopFilter}. */
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new StandardTokenizer(reader);
    result = new StandardFilter(result);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result, stopSet);
    result = new ISOLatin1AccentFilter(result);
    return result;
  }


anorman wrote:
This looks like exactly what I want. Would I implement thisalong withanother analyzer such as the standard or stand alone? Doesanyone have
any
code examples of implementing such a thing?

Thanks,
Albert




karl wettin-3 wrote:
27 aug 2007 kl. 16.03 skrev anorman:
I have a searchable index of documents which contain french and
spanish
diacritics (è, é, À) etc.  I would like to make the content
searchable so
that when a user searches for a word such as "Amèrique" or"Amerique"
(without diacritic) then it returns the same results.

Has anyone set up something similar?
ISOLatin1AccentFilter

--
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--

View this message in context: http://www.nabble.com/Searching-Diacritics-tf4335454.html#a12354962

Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching Diacritics

Reply via email to