Re: How to support stemming and case folding for english content mixed with non-english content?

Robert Muir Fri, 05 Jun 2009 05:01:22 -0700

i think you are on the right track... once you build your analyzer, put it
in your classpath and play around with it in luke and see if it does what
you want.


On Fri, Jun 5, 2009 at 3:19 AM, KK <dioxide.softw...@gmail.com> wrote:

> Hi Robert,
> This is what I copied from ThaiAnalyzer @ lucene contrib
>
> public class ThaiAnalyzer extends Analyzer {
>  public TokenStream tokenStream(String fieldName, Reader reader) {
>      TokenStream ts = new StandardTokenizer(reader);
>    ts = new StandardFilter(ts);
>    ts = new ThaiWordFilter(ts);
>    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
>    return ts;
>  }
> }
>
> Now as you said, I've to use whitespacetokenizer
> withworddelimitefilter[solr
> nightly.jar] stop wordremoval, porter stemmer etc , so it is something like
> this,
> public class IndicAnalyzer extends Analyzer {
>  public TokenStream tokenStream(String fieldName, Reader reader) {
>   TokenStream ts = new WhiteSpaceTokenizer(reader);
>   ts = new WordDelimiterFilter(ts);
>   ts = new LowerCaseFilter(ts);
>   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)   // english
> stop filter, is this the default one?
>   ts = new PorterFilter(ts);
>   return ts;
>  }
> }
>
> Does this sound OK? I think it will do the job...let me try it out..
> I dont need custom filter as per my requirement, at least not for these
> basic things I'm doing? I think so...
>
> Thanks,
> KK.
>
>
> On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rcm...@gmail.com> wrote:
>
> > KK well you can always get some good examples from the lucene contrib
> > codebase.
> > For example, look at the DutchAnalyzer, especially:
> >
> > TokenStream tokenStream(String fieldName, Reader reader)
> >
> > See how it combines a specified tokenizer with various filters? this is
> > what
> > you want to do, except of course you want to use different tokenizer and
> > filters.
> >
> > On Thu, Jun 4, 2009 at 8:53 AM, KK <dioxide.softw...@gmail.com> wrote:
> >
> > > Thanks Muir.
> > > Thanks for letting me know that I dont need language identifiers.
> > >  I'll have a look and will try to write the analyzer. For my case I
> think
> > > it
> > > wont be that difficult.
> > > BTW, can you point me to some sample codes/tutorials writing custom
> > > analyzers. I could not find something in LIA2ndEdn. Is something htere?
> > do
> > > let me know.
> > >
> > > Thanks,
> > > KK.
> > >
> > >
> > >
> > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <rcm...@gmail.com> wrote:
> > >
> > > > KK, for your case, you don't really need to go to the effort of
> > detecting
> > > > whether fragments are english or not.
> > > > Because the English stemmers in lucene will not modify your Indic
> text,
> > > and
> > > > neither will the LowerCaseFilter.
> > > >
> > > > what you want to do is create a custom analyzer that works like this
> > > >
> > > > -WhitespaceTokenizer with WordDelimiterFilter [from Solr nightly
> jar],
> > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> > > >
> > > > Thanks,
> > > > Robert
> > > >
> > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <dioxide.softw...@gmail.com>
> wrote:
> > > >
> > > > > Thank you all.
> > > > > To be frank I was using Solr in the begining half a month ago. The
> > > > > problem[rather bug] with solr was creation of new index on the fly.
> > > > Though
> > > > > they have a restful method for teh same, but it was not working. If
> I
> > > > > remember properly one of Solr commiter "Noble Paul"[I dont know his
> > > real
> > > > > name] was trying to help me. I tried many nightly builds and
> spending
> > a
> > > > > couple of days stuck at that made me think of lucene and I switched
> > to
> > > > it.
> > > > > Now after working with lucene which gives you full control of
> > > everything
> > > > I
> > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene is similar to
> > > > > Window$:Linux, its my view only, though]. Coming back to the point
> as
> > > Uwe
> > > > > mentioned that we can do the same thing in lucene as well, what is
> > > > > available
> > > > > in Solr, Solr is based on Lucene only, right?
> > > > > I request Uwe to give me some more ideas on using the analyzers
> from
> > > solr
> > > > > that will do the job for me, handling a mix of both english and
> > > > non-english
> > > > > content.
> > > > > Muir, can you give me a bit detail description of how to use the
> > > > > WordDelimiteFilter to do my job.
> > > > > On a side note, I was thingking of writing a simple analyzer that
> > will
> > > do
> > > > > the following,
> > > > > #. If the webpage fragment is non-english[for me its some indian
> > > > language]
> > > > > then index them as such, no stemming/ stop word removal to begin
> > with.
> > > As
> > > > I
> > > > > know its in UCN unicode something like
> \u0021\u0012\u34ae\u0031[just
> > a
> > > > > sample]
> > > > > # If the fragment is english then apply standard anlyzing process
> for
> > > > > english content. I've not thought of quering in the same way as of
> > now
> > > > i.e
> > > > > mix of non-english and engish words.
> > > > > Now to get all this,
> > > > >  #1. I need some sort of way which will let me know if the content
> is
> > > > > english or not. If not english just add the tokens to the document.
> > Do
> > > we
> > > > > really need language identifiers, as i dont have any other content
> > that
> > > > > uses
> > > > > the same script as english other than those \u1234 things for my
> > indian
> > > > > language content. Any smart hack/trick for the same?
> > > > >  #2. If the its english apply all normal process and add the
> stemmed
> > > > token
> > > > > to document.
> > > > > For all this I was thinking of iterating earch word of the web page
> > and
> > > > > apply the above procedure. And finallyadd  the newly created
> document
> > > to
> > > > > the
> > > > > index.
> > > > >
> > > > > I would like some one to guide me in this direction. I'm pretty
> > people
> > > > must
> > > > > have done similar/same thing earlier, I request them to guide me/
> > point
> > > > me
> > > > > to some tutorials for the same.
> > > > > Else help me out writing a custom analyzer only if thats not going
> to
> > > be
> > > > > too
> > > > > complex. LOL, I'm a new user to lucene and know basics of Java
> > coding.
> > > > > Thank you very much.
> > > > >
> > > > > --KK.
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <rcm...@gmail.com>
> > wrote:
> > > > >
> > > > > > yes this is true. for starters KK, might be good to startup solr
> > and
> > > > look
> > > > > > at
> > > > > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> > > > > >
> > > > > > if you want to stick with lucene, the WordDelimiterFilter is the
> > > piece
> > > > > you
> > > > > > will want for your text, mainly for punctuation but also for
> format
> > > > > > characters such as ZWJ/ZWNJ.
> > > > > >
> > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <u...@thetaphi.de>
> > > wrote:
> > > > > >
> > > > > > > You can also re-use the solr analyzers, as far as I found out.
> > > There
> > > > is
> > > > > > an
> > > > > > > issue in jIRA/discussion on java-dev to merge them.
> > > > > > >
> > > > > > > -----
> > > > > > > Uwe Schindler
> > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > > > http://www.thetaphi.de
> > > > > > > eMail: u...@thetaphi.de
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Robert Muir [mailto:rcm...@gmail.com]
> > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
> > > > > > > > To: java-user@lucene.apache.org
> > > > > > > > Subject: Re: How to support stemming and case folding for
> > english
> > > > > > content
> > > > > > > > mixed with non-english content?
> > > > > > > >
> > > > > > > > KK, ok, so you only really want to stem the english. This is
> > > good.
> > > > > > > >
> > > > > > > > Is it possible for you to consider using solr? solr's default
> > > > > analyzer
> > > > > > > for
> > > > > > > > type 'text' will be good for your case. it will do the
> > following
> > > > > > > > 1. tokenize on whitespace
> > > > > > > > 2. handle both indian language and english punctuation
> > > > > > > > 3. lowercase the english.
> > > > > > > > 4. stem the english.
> > > > > > > >
> > > > > > > > try a nightly build,
> > > > > > > http://people.apache.org/builds/lucene/solr/nightly/
> > > > > > > >
> > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <
> dioxide.softw...@gmail.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Muir, thanks for your response.
> > > > > > > > > I'm indexing indian language web pages which has got
> descent
> > > > amount
> > > > > > of
> > > > > > > > > english content mixed with therein. For the time being I'm
> > not
> > > > > going
> > > > > > to
> > > > > > > > use
> > > > > > > > > any stemmers as we don't have standard stemmers for indian
> > > > > languages
> > > > > > .
> > > > > > > > So
> > > > > > > > > what I want to do is like this,
> > > > > > > > > Say I've a web page having hindi content with 5% english
> > > content.
> > > > > > Then
> > > > > > > > for
> > > > > > > > > hindi I want to use the basic white space analyzer as we
> dont
> > > > have
> > > > > > > > stemmers
> > > > > > > > > for this as I mentioned earlier and whereever english
> appears
> > I
> > > > > want
> > > > > > > > them
> > > > > > > > > to
> > > > > > > > > be stemmed tokenized etc[the standard process used for
> > english
> > > > > > > content].
> > > > > > > > As
> > > > > > > > > of now I'm using whitespace analyzer for the full content
> > which
> > > > > > doesnot
> > > > > > > > > support case folding, stemming etc for teh content. So if
> > there
> > > > is
> > > > > an
> > > > > > > > > english word say "Detection" indexed as such then searching
> > for
> > > > > > > > detection
> > > > > > > > > or
> > > > > > > > > detect is not giving any results, which is the expected
> > > behavior,
> > > > > but
> > > > > > I
> > > > > > > > > want
> > > > > > > > > this kind of queries to give results.
> > > > > > > > > I hope I made it clear. Let me know any ideas on doing the
> > > same.
> > > > > And
> > > > > > > one
> > > > > > > > > more thing, I'm storing the full webpage content under a
> > single
> > > > > > field,
> > > > > > > I
> > > > > > > > > hope this will not make any difference, right?
> > > > > > > > > It seems I've to use language identifiers, but do we really
> > > need
> > > > > > that?
> > > > > > > > > Because we've only non-english content mixed with
> english[and
> > > not
> > > > > > > french
> > > > > > > > or
> > > > > > > > > russian etc].
> > > > > > > > >
> > > > > > > > > What is the best way of approaching the problem? Any
> > thoughts!
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > KK.
> > > > > > > > >
> > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
> > rcm...@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > KK, is all of your latin script text actually english? Is
> > > there
> > > > > > stuff
> > > > > > > > > like
> > > > > > > > > > german or french mixed in?
> > > > > > > > > >
> > > > > > > > > > And for your non-english content (your examples have been
> > > > indian
> > > > > > > > writing
> > > > > > > > > > systems), is it generally true that if you had
> devanagari,
> > > you
> > > > > can
> > > > > > > > assume
> > > > > > > > > > its hindi? or is there stuff like marathi mixed in?
> > > > > > > > > >
> > > > > > > > > > Reason I say this is to invoke the right stemmers, you
> > really
> > > > > need
> > > > > > > > some
> > > > > > > > > > language detection, but perhaps in your case you can
> cheat
> > > and
> > > > > > detect
> > > > > > > > > this
> > > > > > > > > > based on scripts...
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Robert
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> > > > dioxide.softw...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi All,
> > > > > > > > > > > I'm indexing some non-english content. But the page
> also
> > > > > contains
> > > > > > > > > english
> > > > > > > > > > > content. As of now I'm using WhitespaceAnalyzer for all
> > > > content
> > > > > > and
> > > > > > > > I'm
> > > > > > > > > > > storing the full webpage content under a single filed.
> > Now
> > > we
> > > > > > > > require
> > > > > > > > > to
> > > > > > > > > > > support case folding and stemmming for the english
> > content
> > > > > > > > intermingled
> > > > > > > > > > > with
> > > > > > > > > > > non-english content. I must metion that we dont have
> > > stemming
> > > > > and
> > > > > > > > case
> > > > > > > > > > > folding for these non-english content. I'm stuck with
> > this.
> > > > > Some
> > > > > > > one
> > > > > > > > do
> > > > > > > > > > let
> > > > > > > > > > > me know how to proceed for fixing this issue.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > KK.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Robert Muir
> > > > > > > > > > rcm...@gmail.com
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Robert Muir
> > > > > > > > rcm...@gmail.com
> > > > > > >
> > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> java-user-unsubscr...@lucene.apache.org
> > > > > > > For additional commands, e-mail:
> > java-user-h...@lucene.apache.org
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Robert Muir
> > > > > > rcm...@gmail.com
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcm...@gmail.com
> > > >
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcm...@gmail.com
> >
>



-- 
Robert Muir
rcm...@gmail.com

Re: How to support stemming and case folding for english content mixed with non-english content?

Reply via email to