Re: Hit highlighting for non-english unicode index/queries not working?

KK Tue, 26 May 2009 06:51:52 -0700

Thank you Erick.
As of now I'm using whitespaceanalyzer and no stemming and not stop word
remova. Now I feel writing a simple analyzer won't be that difficult after
going thru your mail. I'll give it a try. I don't have any idea on filters
but I'm pretty it must be simple and will definitely go through the examples
of LIA 2ndEdn. Thank you.


--KK

On Tue, May 26, 2009 at 6:55 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> It's fairly easy to construct your own analyzer bystringing together some
> filters and tokenizers. LIA (1st ed)
> had a SynonymAnalyzer. You probably want something like
> (WARNING, example only, I'm not even sure it compiles!! Ripped
> off  from the WIKI)
>
> public class MyAnalyzer extends Analyzer
> {
>    public TokenStream tokenStream (String field, final Reader reader) {
>            return  new LowercaseFilter (new WhitespaceTokenizer(reader));
>   }
> }
>
> There are a number of Filters you can string together if you want to, say,
> remove stop words etc..
>
> HTH
> Erick
>
> On Tue, May 26, 2009 at 6:38 AM, KK <dioxide.softw...@gmail.com> wrote:
>
> > Thank you @Muir.
> > I was earlier using simpleanalyzer for all purposes but as you
> reccomended
> > me the whitespace one, I tried to use that analyzer and good thing is
> that
> > I'm able to index/search non-english text as well as supporting hit
> > highlighting for these non-english texts. Thank you very much.
> > But now there is one silly problem. As whitespaceanalyzer doesnot do
> > anything other than separating the tokens based on the space, for english
> > pages case-folding is getting missed. Unless I provide the exact words
> > including the right cases it doesnot give me results, which is quite
> > obivious. As I went thru the LIA 2nd Edn book, found that it mentions we
> > can
> > use analyzers on document level and also on field level. I was quite
> amazed
> > at the granularity of analysis supported by Lucene. But its there we just
> > have to make use of it. So I'm thinking of giving it a try that will help
> > me
> > support  both english and non-english indexing/searching/highlighting.
> > Thank
> > you all. Any ideas on the same are always welcome.
> >
> > Thanks,
> > KK.
> >
> >
> > On Tue, May 26, 2009 at 1:24 AM, Robert Muir <rcm...@gmail.com> wrote:
> >
> > > as mentioned previously, i dont think your text is being analyzed the
> way
> > > you want.
> > >
> > > SimpleAnalyzer will break your word
> \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
> > > (பரிணாம) into 3 tokens:
> > >
> > > \u0BAA\u0BB0
> > > \u0BA3
> > > \u0BAE
> > >
> > > Not only does it incorrectly split your word into three words, but it
> > > completely drops the dependent vowels (\u0BBF and \u0BBE).
> > >
> > > This is why i would recommend trying whitespace analyzer instead.
> > > Also take a look at the Luke index tool, its a very quick way to see
> how
> > > your words are being analyzed by various analyzers.
> > >
> > >
> > > On Mon, May 25, 2009 at 10:02 AM, KK <dioxide.softw...@gmail.com>
> wrote:
> > >
> > > > Hi,
> > > > I'm trying to index some non-english texts. Indexing and searching is
> > > > working fine. From command line I'm able to provide the utf-8
> unicoded
> > > text
> > > > as input like this,
> > > > \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
> > > > and able to get the search results.
> > > > Then I tried to add hit highlighting for the same. So I started with
> > > simple
> > > > english texts and used pharse queries for providing input queries. My
> > > code
> > > > looks like this,
> > > >
> > > >
> > > > import java.io.FileReader;
> > > > import java.io.IOException;
> > > > import java.io.InputStreamReader;
> > > > import java.util.Date;
> > > > import java.io.*;
> > > > import java.nio.charset.Charset;
> > > >
> > > > import org.apache.lucene.analysis.Analyzer;
> > > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > > import org.apache.lucene.document.Document;
> > > > import org.apache.lucene.index.FilterIndexReader;
> > > > import org.apache.lucene.index.IndexReader;
> > > > import org.apache.lucene.index.Term;
> > > > import org.apache.lucene.queryParser.QueryParser;
> > > > import org.apache.lucene.search.HitCollector;
> > > > import org.apache.lucene.search.Hits;
> > > > import org.apache.lucene.search.IndexSearcher;
> > > > import org.apache.lucene.search.Query;
> > > > import org.apache.lucene.search.PhraseQuery;
> > > > import org.apache.lucene.search.ScoreDoc;
> > > > import org.apache.lucene.search.Searcher;
> > > > import org.apache.lucene.search.TopDocCollector;
> > > > import org.apache.lucene.search.highlight.Highlighter;
> > > > import org.apache.lucene.search.highlight.QueryScorer;
> > > > import org.apache.lucene.search.Scorer;
> > > > import org.apache.lucene.analysis.TokenStream;
> > > > import org.apache.lucene.analysis.SimpleAnalyzer;
> > > >
> > > >
> > > > /** Simple command-line based search demo. */
> > > > public class LuceneSearcher {
> > > >    private static final String indexPath = "/opt/lucene/index" +
> > > "/core36";
> > > > //core36 refers to the exact index directory for tamil pages
> > > >
> > > >    private void searchIndex(String terms) throws Exception{
> > > >        String queryString = "";
> > > >        PhraseQuery phrase = new PhraseQuery();
> > > >        String[] termArray = terms.split(" ");
> > > >        for (int i=0; i<termArray.length; i++) {
> > > >            System.out.println("adding " + termArray[i]);
> > > >            //phrase.add(new Term("content", termArray[i]));
> > > >            //queryString += termArray[i];
> > > >        }
> > > >        /
> > > >        //phrase.add(new Term("content", "ubuntu"));
> > > >        String tamilQuery = new
> > > > String("\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0");
> > > >        //tamilQuery = new String("ubuntu");
> > > >        phrase.add(new Term("content", tamilQuery));
> > > >        phrase.setSlop(1);
> > > >        System.out.println("phrase query " + phrase.toString());
> > > >
> > > >         IndexSearcher searcher = new IndexSearcher(indexPath);
> > > >        QueryParser queryParser = null;
> > > >        try {
> > > >            queryParser = new QueryParser("content", new
> > > SimpleAnalyzer());
> > > >        } catch (Exception ex) {
> > > >             ex.printStackTrace();
> > > >        }
> > > >
> > > >        //Query query = queryParser.parse(queryString);
> > > >
> > > >        Hits hits = null;
> > > >        try {
> > > >             hits = searcher.search(phrase);
> > > >        } catch (Exception ex) {
> > > >             ex.printStackTrace();
> > > >        }
> > > >        //for highlighter section
> > > >        QueryScorer scorer = new QueryScorer(phrase);
> > > >        Highlighter highlighter = new Highlighter(scorer);
> > > >
> > > >        for (int i = 0; i < hits.length(); i++) {
> > > >            String content = hits.doc(i).get("content");
> > > >            TokenStream stream = new
> > > SimpleAnalyzer().tokenStream("content",
> > > > new StringReader(content));
> > > >            String fragment = highlighter.getBestFragments(stream,
> > > content,
> > > > 5, "...");
> > > >            System.out.println(fragment);
> > > >        }
> > > >
> > > >
> > > >        int hitCount = hits.length();
> > > >        System.out.println("Results found :" + hitCount);
> > > >
> > > >        /*
> > > >        for (int ix=0; ix<hitCount; ix++) {
> > > >             Document doc = hits.doc(ix);
> > > >            System.out.println(doc.get("content"));
> > > >        }
> > > >        */
> > > >    }
> > > >
> > > >    public static void main(String args[]) throws Exception{
> > > >         LuceneSearcher searcher = new LuceneSearcher();
> > > >        String termString = args[0];
> > > >        System.out.println("searching for " + args[0]);
> > > >        searcher.searchIndex(termString);
> > > >    }
> > > >
> > > > }
> > > > ----------------------code ends here---------------------------------
> > > > NB: Please ignore basic coding conventio[ indentations, comments
> etc].
> > > You
> > > > might find some unneccesary code intermixed with the highlighting
> code,
> > > > ignore them .
> > > >
> > > > Now when I searched for some english docs I got the results with
> > <b></b>
> > > > tags sorrounding the hits like this,
> > > >
> > > > <B>Ubuntu</B> Press Releases Media Contact <B>Ubuntu</B> News Home
> > > > <B>Ubuntu</B> Security NoticesThese are the <B>Ubuntu</B> security
> > > notices
> > > > that affect the current supported releases of <B>Ubuntu</B>. These
> > > notices
> > > > are also posted
> > > >
> > > > Now I thought of testing the same for temil texts. Before this I
> would
> > > like
> > > > to add one more information that prior to adding the codes for
> > > highlighting
> > > > I was able to search a lucene index from the command line using the
> raw
> > > > unicode texts like this,
> > > > [...@kk-laptop]$ java LuceneSearcher
> > > "\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0"
> > > >
> > > > and it gives me the page that mathces the above query. Now I tried to
> > do
> > > > the
> > > > same alongwith highliting. So in the code I posted above you can see
> > that
> > > I
> > > > commented out the english terms and added one tamil unicode query and
> > > tried
> > > > to see If it gives me the same result that I was getting prior to
> > > > highlighting and found that I'm not getting any results. This might
> be
> > > > because the query I'm forming using these unicode texts is wrong, or
> > may
> > > be
> > > > something else. I'm not able to figure out what exactly is going
> wrong?
> > > > Some
> > > > silly mistake I guess, still I'm not able to find out. Can some one
> > take
> > > > the
> > > > pain to go throgh the above code and find out whats wrong. Thank you
> > very
> > > > much.
> > > >
> > > > Thanks,
> > > > KK.
> > > >
> > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcm...@gmail.com
> > >
> >
>

Re: Hit highlighting for non-english unicode index/queries not working?

Reply via email to