Hit highlighting for non-english unicode index/queries not working?

KK Mon, 25 May 2009 07:03:02 -0700

Hi,
I'm trying to index some non-english texts. Indexing and searching is
working fine. From command line I'm able to provide the utf-8 unicoded text
as input like this,
\u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
and able to get the search results.
Then I tried to add hit highlighting for the same. So I started with simple
english texts and used pharse queries for providing input queries. My code
looks like this,



import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;
import java.io.*;
import java.nio.charset.Charset;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.FilterIndexReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.HitCollector;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.TopDocCollector;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.Scorer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.SimpleAnalyzer;


/** Simple command-line based search demo. */
public class LuceneSearcher {
    private static final String indexPath = "/opt/lucene/index" + "/core36";
//core36 refers to the exact index directory for tamil pages

    private void searchIndex(String terms) throws Exception{
        String queryString = "";
        PhraseQuery phrase = new PhraseQuery();
        String[] termArray = terms.split(" ");
        for (int i=0; i<termArray.length; i++) {
            System.out.println("adding " + termArray[i]);
            //phrase.add(new Term("content", termArray[i]));
            //queryString += termArray[i];
        }
        /
        //phrase.add(new Term("content", "ubuntu"));
        String tamilQuery = new
String("\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0");
        //tamilQuery = new String("ubuntu");
        phrase.add(new Term("content", tamilQuery));
        phrase.setSlop(1);
        System.out.println("phrase query " + phrase.toString());

         IndexSearcher searcher = new IndexSearcher(indexPath);
        QueryParser queryParser = null;
        try {
            queryParser = new QueryParser("content", new SimpleAnalyzer());
        } catch (Exception ex) {
             ex.printStackTrace();
        }

        //Query query = queryParser.parse(queryString);

        Hits hits = null;
        try {
             hits = searcher.search(phrase);
        } catch (Exception ex) {
             ex.printStackTrace();
        }
        //for highlighter section
        QueryScorer scorer = new QueryScorer(phrase);
        Highlighter highlighter = new Highlighter(scorer);

        for (int i = 0; i < hits.length(); i++) {
            String content = hits.doc(i).get("content");
            TokenStream stream = new SimpleAnalyzer().tokenStream("content",
new StringReader(content));
            String fragment = highlighter.getBestFragments(stream, content,
5, "...");
            System.out.println(fragment);
        }


        int hitCount = hits.length();
        System.out.println("Results found :" + hitCount);

        /*
        for (int ix=0; ix<hitCount; ix++) {
             Document doc = hits.doc(ix);
            System.out.println(doc.get("content"));
        }
        */
    }

    public static void main(String args[]) throws Exception{
         LuceneSearcher searcher = new LuceneSearcher();
        String termString = args[0];
        System.out.println("searching for " + args[0]);
        searcher.searchIndex(termString);
    }

}
----------------------code ends here---------------------------------
NB: Please ignore basic coding conventio[ indentations, comments etc]. You
might find some unneccesary code intermixed with the highlighting code,
ignore them .

Now when I searched for some english docs I got the results with <b></b>
tags sorrounding the hits like this,

<B>Ubuntu</B> Press Releases Media Contact <B>Ubuntu</B> News Home
<B>Ubuntu</B> Security NoticesThese are the <B>Ubuntu</B> security notices
that affect the current supported releases of <B>Ubuntu</B>. These notices
are also posted

Now I thought of testing the same for temil texts. Before this I would like
to add one more information that prior to adding the codes for highlighting
I was able to search a lucene index from the command line using the raw
unicode texts like this,
[...@kk-laptop]$ java LuceneSearcher "\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0"

and it gives me the page that mathces the above query. Now I tried to do the
same alongwith highliting. So in the code I posted above you can see that I
commented out the english terms and added one tamil unicode query and tried
to see If it gives me the same result that I was getting prior to
highlighting and found that I'm not getting any results. This might be
because the query I'm forming using these unicode texts is wrong, or may be
something else. I'm not able to figure out what exactly is going wrong? Some
silly mistake I guess, still I'm not able to find out. Can some one take the
pain to go throgh the above code and find out whats wrong. Thank you very
much.

Thanks,
KK.

Hit highlighting for non-english unicode index/queries not working?

Reply via email to