Hi, I'm trying to index some non-english texts. Indexing and searching is working fine. From command line I'm able to provide the utf-8 unicoded text as input like this, \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE and able to get the search results. Then I tried to add hit highlighting for the same. So I started with simple english texts and used pharse queries for providing input queries. My code looks like this,
import java.io.FileReader; import java.io.IOException; import java.io.InputStreamReader; import java.util.Date; import java.io.*; import java.nio.charset.Charset; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.index.FilterIndexReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.HitCollector; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.Searcher; import org.apache.lucene.search.TopDocCollector; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.Scorer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.SimpleAnalyzer; /** Simple command-line based search demo. */ public class LuceneSearcher { private static final String indexPath = "/opt/lucene/index" + "/core36"; //core36 refers to the exact index directory for tamil pages private void searchIndex(String terms) throws Exception{ String queryString = ""; PhraseQuery phrase = new PhraseQuery(); String[] termArray = terms.split(" "); for (int i=0; i<termArray.length; i++) { System.out.println("adding " + termArray[i]); //phrase.add(new Term("content", termArray[i])); //queryString += termArray[i]; } / //phrase.add(new Term("content", "ubuntu")); String tamilQuery = new String("\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0"); //tamilQuery = new String("ubuntu"); phrase.add(new Term("content", tamilQuery)); phrase.setSlop(1); System.out.println("phrase query " + phrase.toString()); IndexSearcher searcher = new IndexSearcher(indexPath); QueryParser queryParser = null; try { queryParser = new QueryParser("content", new SimpleAnalyzer()); } catch (Exception ex) { ex.printStackTrace(); } //Query query = queryParser.parse(queryString); Hits hits = null; try { hits = searcher.search(phrase); } catch (Exception ex) { ex.printStackTrace(); } //for highlighter section QueryScorer scorer = new QueryScorer(phrase); Highlighter highlighter = new Highlighter(scorer); for (int i = 0; i < hits.length(); i++) { String content = hits.doc(i).get("content"); TokenStream stream = new SimpleAnalyzer().tokenStream("content", new StringReader(content)); String fragment = highlighter.getBestFragments(stream, content, 5, "..."); System.out.println(fragment); } int hitCount = hits.length(); System.out.println("Results found :" + hitCount); /* for (int ix=0; ix<hitCount; ix++) { Document doc = hits.doc(ix); System.out.println(doc.get("content")); } */ } public static void main(String args[]) throws Exception{ LuceneSearcher searcher = new LuceneSearcher(); String termString = args[0]; System.out.println("searching for " + args[0]); searcher.searchIndex(termString); } } ----------------------code ends here--------------------------------- NB: Please ignore basic coding conventio[ indentations, comments etc]. You might find some unneccesary code intermixed with the highlighting code, ignore them . Now when I searched for some english docs I got the results with <b></b> tags sorrounding the hits like this, <B>Ubuntu</B> Press Releases Media Contact <B>Ubuntu</B> News Home <B>Ubuntu</B> Security NoticesThese are the <B>Ubuntu</B> security notices that affect the current supported releases of <B>Ubuntu</B>. These notices are also posted Now I thought of testing the same for temil texts. Before this I would like to add one more information that prior to adding the codes for highlighting I was able to search a lucene index from the command line using the raw unicode texts like this, [...@kk-laptop]$ java LuceneSearcher "\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0" and it gives me the page that mathces the above query. Now I tried to do the same alongwith highliting. So in the code I posted above you can see that I commented out the english terms and added one tamil unicode query and tried to see If it gives me the same result that I was getting prior to highlighting and found that I'm not getting any results. This might be because the query I'm forming using these unicode texts is wrong, or may be something else. I'm not able to figure out what exactly is going wrong? Some silly mistake I guess, still I'm not able to find out. Can some one take the pain to go throgh the above code and find out whats wrong. Thank you very much. Thanks, KK.