Re: Lucene Retrieve Previous and Next Tokens At Analyzed Index

Furkan KAMACI Fri, 28 Feb 2014 01:13:28 -0800

Hi Uwe;

I will try to do your approach. I have considered to use highlighter for
that purpose: if I index the corpus I can search on it again and again so I
can generate models for any other terms whenever I want. However analyze
time processing is applicable for me too.


Thanks;
Furkan KAMACI


2014-02-28 10:30 GMT+02:00 Uwe Schindler <u...@thetaphi.de>:

> Hi,
>
> You can do that in your TokenFilter by buffering tokens using
> captureState() or cloneAttributes() and storing them in a circular buffer
>  of size=20 (or like that). Emitting tokens to the consumer is then done
> "delayed": Once you collected 20 tokens from the "input" tokenfilter, do
> you analysis on them and emit modified tokens from the beginning of the
> buffer to the consumer. There is no sample code available, but this should
> be possible to do. But: We have lots of filters that put one *single* token
> away and emit it later (most stemmers do this to emit the original and
> stemmed token as 2 tokens). This can be used as a base for an algorithm
> putting away multiple tokens.
>
> Highlighter is not applicable here, as it works when querying a Lucene
> index, not while indexing.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -----Original Message-----
> > From: Furkan KAMACI [mailto:furkankam...@gmail.com]
> > Sent: Friday, February 28, 2014 9:23 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: Lucene Retrieve Previous and Next Tokens At Analyzed Index
> >
> > Hi;
> >
> > I want to implement a stemming algorithm for an NLP purpose. I am
> > analyzing Turkish language. Turkish is a different kind of language that
> is not
> > easy to do stemming. For many cases you can just  *predict* "root form"
> of a
> > given word with the help of context. I will just implement a basic
> algorithm
> > and then change conditions and compare results (I will not use a library
> for
> > my purpose this is an academic research).
> >
> > I will take previous 10 tokens and next 10 tokens of a word that starts
> with a
> > given word as like: *kale*  *I will calculate the entropy to guess the
> root form
> > of a given word. I mean I will resolve disambiguation.
> >
> > Maybe Highlighter can do what I want if I can say that: get previous 10
> and
> > next 10 tokens of matched term?
> >
> > Thanks;
> > Furkan KAMACI
> >
> >
> > 2014-02-28 9:06 GMT+02:00 pravesh <suyalprav...@yahoo.com>:
> >
> > > Hi,
> > > A little bit of details would further help. Any examples?  Also what
> > > is the use-case for this?
> > >
> > >
> > > Regards
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > > http://lucene.472066.n3.nabble.com/Lucene-Retrieve-Previous-and-Next-
> > T
> > > okens-At-Analyzed-Index-tp4120076p4120340.html
> > > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Lucene Retrieve Previous and Next Tokens At Analyzed Index

Reply via email to