Thank you all.
@Muir
Thanks for sharing your views. I'ld like to have some more details on the
process you mentioned as I've absolutely no idea on this highlighting
stuffs, could not make much out of our mail. Can you point me to some
tutorials/good write ups on the same, if you have some write ups on the
same, do give me the pointers. it'll help me a lot.
Pointers to the unicode default algorithms mentioned in your mail will be
equally helpful.

Thanks,
KK.

On Thu, May 21, 2009 at 8:03 PM, Robert Muir <rcm...@gmail.com> wrote:

> its definitely an area in lucene that could use some improvement.
>
> my recommendation for multilingual text is to apply the unicode "default"
> algorithms:
>
> Tokenize text according to UAX #29: unicode text segmentation
> Apply full case-folding (unicode ch. 3.13) with FC_NFKC closure
> Apply UAX #15: unicode normalization
>
> for now you will have to write code to do this, but i'm looking forward to
> contributing my implementation soon.
>
> i definitely feel your pain.
>
> On Thu, May 21, 2009 at 9:12 AM, Joel Halbert <j...@su3analytics.com>
> wrote:
>
> >
> > > If I index english pages
> > > with the same indexer, it will not take care of stemming and stop word
> > > removal?
> >
> > correct
> >
> >
> > > Cant we have a single indexer that handles non-eng and eng in
> > > equally good ways?
> >
> > You can have a single indexer, but, if you wanted to use one Analyzer for
> > English documents (with stemming/stops) and another analyzer for other
> > language documents
> > then you would need to know, at the point of both *indexing* and
> *querying*
> > what language your indexed document and your query were in.
> >
> > This makes the assumption that when a query is in English you only want
> to
> > query English lang docs, and vica versa.
> > You would also have to mark up your documents with a language identifier
> > (i.e. 0=English, 1=Other Languages) so that when you query you have a
> > conditional on the language.
> >
> >
> >
> > I've not had to deal with multi-language documents though - so I'm sure
> > others will be better placed to offer their experience.
> >
> >
> >
> > -----Original Message-----
> > From: KK <dioxide.softw...@gmail.com>
> > Reply-To: java-user@lucene.apache.org
> > To: java-user@lucene.apache.org
> > Subject: Re: hit highlighting in lucene ?
> > Date: Thu, 21 May 2009 18:31:44 +0530
> >
> > Initially I was using standardAnalyzer but I switched to simpleAnalyzer
> > which I guess doesnot do more that tokenizing[and may be tokenizing] and
> I
> > think this does not do stemming which I dont/cant do because I've no
> > stemmer
> > for the languages I'm indexing.
> > For indexing and querring I'm using the same SimpelAnalyzer. So as you
> say
> > I
> > can go for the standard highlighter api which I mentioned in my last
> mail,
> > and this will handle any language for highlighting support. I should
> start
> > using this one, right?
> >
> > One more thing. I've a single indexer and searcher that I'm usign for
> > indexing pages of many different non-english languages and as I mentioned
> > earier I'm using simpleAnalyzer, does that mean If I index english pages
> > with the same indexer, it will not take care of stemming and stop word
> > removal? But I dont want to have multiple indexer that is specific to
> > languages. Cant we have a single indexer that handles non-eng and eng in
> > equally good ways? Or any other ideas on the same ?
> >
> > Thanks,
> > KK.
> >
> > On Thu, May 21, 2009 at 6:18 PM, Joel Halbert <j...@su3analytics.com>
> > wrote:
> >
> > > The highlighter should be language independent. So long as you are
> > > consistent with your use of Analyzer between
> > > indexing/query/highlighting.
> > >
> > > As for the most appropriate Analyzer to use for your local language,
> > > this is a seperate question - especially if you are using stop word and
> > > stemming filters.
> > >
> > > The StandardAnalyzer is designed for English since it used the
> > > StopFilter (English words only).
> > >
> > >
> > > -----Original Message-----
> > > From: KK <dioxide.softw...@gmail.com>
> > > Reply-To: java-user@lucene.apache.org
> > > To: java-user@lucene.apache.org
> > > Subject: hit highlighting in lucene ?
> > > Date: Thu, 21 May 2009 17:51:13 +0530
> > >
> > > Hi All,
> > > I was looking for various ways of implementing hit highlighting in
> Lucene
> > > and found some standard classes that does support highlighting like
> this
> > > *lucene*.
> > apache.org/java/2_2_0/api/org/apache/*lucene*/search/*highlight*
> > > /package-summary.html<
> >
> http://apache.org/java/2_2_0/api/org/apache/*lucene*/search/*highlight*%0A/package-summary.html
> > >
> > >
> > > ik but what i believe is that this is only for english or does it
> support
> > > other languages. I actually wanted to support highlighting for some
> > > non-english languages which I'm able to index and fetch using utf-8
> > > encoding. So  this means that if I want to have highlighting then I've
> to
> > > get the utf-8 query and look for the same in the result and add apt
> tags
> > > whereever required, it essentially boils down to implementing the
> > standard
> > > highlighter. I think the standard highlighter also supports other
> > > languages.
> > > Correct me if i'm wrong.
> > >
> > > Due to my requirement constraints I'm using just simpleAnalyzer and we
> > dont
> > > have tokenizers for these regional languages. Any other ideas of doing
> > the
> > > same would be helpful as well.
> > >
> > > Thanks,
> > > KK.
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
>
> --
> Robert Muir
> rcm...@gmail.com
>

Reply via email to