Thank you Erick. As of now I'm using whitespaceanalyzer and no stemming and not stop word remova. Now I feel writing a simple analyzer won't be that difficult after going thru your mail. I'll give it a try. I don't have any idea on filters but I'm pretty it must be simple and will definitely go through the examples of LIA 2ndEdn. Thank you.
--KK On Tue, May 26, 2009 at 6:55 PM, Erick Erickson <erickerick...@gmail.com>wrote: > It's fairly easy to construct your own analyzer bystringing together some > filters and tokenizers. LIA (1st ed) > had a SynonymAnalyzer. You probably want something like > (WARNING, example only, I'm not even sure it compiles!! Ripped > off from the WIKI) > > public class MyAnalyzer extends Analyzer > { > public TokenStream tokenStream (String field, final Reader reader) { > return new LowercaseFilter (new WhitespaceTokenizer(reader)); > } > } > > There are a number of Filters you can string together if you want to, say, > remove stop words etc.. > > HTH > Erick > > On Tue, May 26, 2009 at 6:38 AM, KK <dioxide.softw...@gmail.com> wrote: > > > Thank you @Muir. > > I was earlier using simpleanalyzer for all purposes but as you > reccomended > > me the whitespace one, I tried to use that analyzer and good thing is > that > > I'm able to index/search non-english text as well as supporting hit > > highlighting for these non-english texts. Thank you very much. > > But now there is one silly problem. As whitespaceanalyzer doesnot do > > anything other than separating the tokens based on the space, for english > > pages case-folding is getting missed. Unless I provide the exact words > > including the right cases it doesnot give me results, which is quite > > obivious. As I went thru the LIA 2nd Edn book, found that it mentions we > > can > > use analyzers on document level and also on field level. I was quite > amazed > > at the granularity of analysis supported by Lucene. But its there we just > > have to make use of it. So I'm thinking of giving it a try that will help > > me > > support both english and non-english indexing/searching/highlighting. > > Thank > > you all. Any ideas on the same are always welcome. > > > > Thanks, > > KK. > > > > > > On Tue, May 26, 2009 at 1:24 AM, Robert Muir <rcm...@gmail.com> wrote: > > > > > as mentioned previously, i dont think your text is being analyzed the > way > > > you want. > > > > > > SimpleAnalyzer will break your word > \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE > > > (பரிணாம) into 3 tokens: > > > > > > \u0BAA\u0BB0 > > > \u0BA3 > > > \u0BAE > > > > > > Not only does it incorrectly split your word into three words, but it > > > completely drops the dependent vowels (\u0BBF and \u0BBE). > > > > > > This is why i would recommend trying whitespace analyzer instead. > > > Also take a look at the Luke index tool, its a very quick way to see > how > > > your words are being analyzed by various analyzers. > > > > > > > > > On Mon, May 25, 2009 at 10:02 AM, KK <dioxide.softw...@gmail.com> > wrote: > > > > > > > Hi, > > > > I'm trying to index some non-english texts. Indexing and searching is > > > > working fine. From command line I'm able to provide the utf-8 > unicoded > > > text > > > > as input like this, > > > > \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE > > > > and able to get the search results. > > > > Then I tried to add hit highlighting for the same. So I started with > > > simple > > > > english texts and used pharse queries for providing input queries. My > > > code > > > > looks like this, > > > > > > > > > > > > import java.io.FileReader; > > > > import java.io.IOException; > > > > import java.io.InputStreamReader; > > > > import java.util.Date; > > > > import java.io.*; > > > > import java.nio.charset.Charset; > > > > > > > > import org.apache.lucene.analysis.Analyzer; > > > > import org.apache.lucene.analysis.standard.StandardAnalyzer; > > > > import org.apache.lucene.document.Document; > > > > import org.apache.lucene.index.FilterIndexReader; > > > > import org.apache.lucene.index.IndexReader; > > > > import org.apache.lucene.index.Term; > > > > import org.apache.lucene.queryParser.QueryParser; > > > > import org.apache.lucene.search.HitCollector; > > > > import org.apache.lucene.search.Hits; > > > > import org.apache.lucene.search.IndexSearcher; > > > > import org.apache.lucene.search.Query; > > > > import org.apache.lucene.search.PhraseQuery; > > > > import org.apache.lucene.search.ScoreDoc; > > > > import org.apache.lucene.search.Searcher; > > > > import org.apache.lucene.search.TopDocCollector; > > > > import org.apache.lucene.search.highlight.Highlighter; > > > > import org.apache.lucene.search.highlight.QueryScorer; > > > > import org.apache.lucene.search.Scorer; > > > > import org.apache.lucene.analysis.TokenStream; > > > > import org.apache.lucene.analysis.SimpleAnalyzer; > > > > > > > > > > > > /** Simple command-line based search demo. */ > > > > public class LuceneSearcher { > > > > private static final String indexPath = "/opt/lucene/index" + > > > "/core36"; > > > > //core36 refers to the exact index directory for tamil pages > > > > > > > > private void searchIndex(String terms) throws Exception{ > > > > String queryString = ""; > > > > PhraseQuery phrase = new PhraseQuery(); > > > > String[] termArray = terms.split(" "); > > > > for (int i=0; i<termArray.length; i++) { > > > > System.out.println("adding " + termArray[i]); > > > > //phrase.add(new Term("content", termArray[i])); > > > > //queryString += termArray[i]; > > > > } > > > > / > > > > //phrase.add(new Term("content", "ubuntu")); > > > > String tamilQuery = new > > > > String("\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0"); > > > > //tamilQuery = new String("ubuntu"); > > > > phrase.add(new Term("content", tamilQuery)); > > > > phrase.setSlop(1); > > > > System.out.println("phrase query " + phrase.toString()); > > > > > > > > IndexSearcher searcher = new IndexSearcher(indexPath); > > > > QueryParser queryParser = null; > > > > try { > > > > queryParser = new QueryParser("content", new > > > SimpleAnalyzer()); > > > > } catch (Exception ex) { > > > > ex.printStackTrace(); > > > > } > > > > > > > > //Query query = queryParser.parse(queryString); > > > > > > > > Hits hits = null; > > > > try { > > > > hits = searcher.search(phrase); > > > > } catch (Exception ex) { > > > > ex.printStackTrace(); > > > > } > > > > //for highlighter section > > > > QueryScorer scorer = new QueryScorer(phrase); > > > > Highlighter highlighter = new Highlighter(scorer); > > > > > > > > for (int i = 0; i < hits.length(); i++) { > > > > String content = hits.doc(i).get("content"); > > > > TokenStream stream = new > > > SimpleAnalyzer().tokenStream("content", > > > > new StringReader(content)); > > > > String fragment = highlighter.getBestFragments(stream, > > > content, > > > > 5, "..."); > > > > System.out.println(fragment); > > > > } > > > > > > > > > > > > int hitCount = hits.length(); > > > > System.out.println("Results found :" + hitCount); > > > > > > > > /* > > > > for (int ix=0; ix<hitCount; ix++) { > > > > Document doc = hits.doc(ix); > > > > System.out.println(doc.get("content")); > > > > } > > > > */ > > > > } > > > > > > > > public static void main(String args[]) throws Exception{ > > > > LuceneSearcher searcher = new LuceneSearcher(); > > > > String termString = args[0]; > > > > System.out.println("searching for " + args[0]); > > > > searcher.searchIndex(termString); > > > > } > > > > > > > > } > > > > ----------------------code ends here--------------------------------- > > > > NB: Please ignore basic coding conventio[ indentations, comments > etc]. > > > You > > > > might find some unneccesary code intermixed with the highlighting > code, > > > > ignore them . > > > > > > > > Now when I searched for some english docs I got the results with > > <b></b> > > > > tags sorrounding the hits like this, > > > > > > > > <B>Ubuntu</B> Press Releases Media Contact <B>Ubuntu</B> News Home > > > > <B>Ubuntu</B> Security NoticesThese are the <B>Ubuntu</B> security > > > notices > > > > that affect the current supported releases of <B>Ubuntu</B>. These > > > notices > > > > are also posted > > > > > > > > Now I thought of testing the same for temil texts. Before this I > would > > > like > > > > to add one more information that prior to adding the codes for > > > highlighting > > > > I was able to search a lucene index from the command line using the > raw > > > > unicode texts like this, > > > > [...@kk-laptop]$ java LuceneSearcher > > > "\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0" > > > > > > > > and it gives me the page that mathces the above query. Now I tried to > > do > > > > the > > > > same alongwith highliting. So in the code I posted above you can see > > that > > > I > > > > commented out the english terms and added one tamil unicode query and > > > tried > > > > to see If it gives me the same result that I was getting prior to > > > > highlighting and found that I'm not getting any results. This might > be > > > > because the query I'm forming using these unicode texts is wrong, or > > may > > > be > > > > something else. I'm not able to figure out what exactly is going > wrong? > > > > Some > > > > silly mistake I guess, still I'm not able to find out. Can some one > > take > > > > the > > > > pain to go throgh the above code and find out whats wrong. Thank you > > very > > > > much. > > > > > > > > Thanks, > > > > KK. > > > > > > > > > > > > > > > > -- > > > Robert Muir > > > rcm...@gmail.com > > > > > >