Hi Uwe; I will try to do your approach. I have considered to use highlighter for that purpose: if I index the corpus I can search on it again and again so I can generate models for any other terms whenever I want. However analyze time processing is applicable for me too.
Thanks; Furkan KAMACI 2014-02-28 10:30 GMT+02:00 Uwe Schindler <u...@thetaphi.de>: > Hi, > > You can do that in your TokenFilter by buffering tokens using > captureState() or cloneAttributes() and storing them in a circular buffer > of size=20 (or like that). Emitting tokens to the consumer is then done > "delayed": Once you collected 20 tokens from the "input" tokenfilter, do > you analysis on them and emit modified tokens from the beginning of the > buffer to the consumer. There is no sample code available, but this should > be possible to do. But: We have lots of filters that put one *single* token > away and emit it later (most stemmers do this to emit the original and > stemmed token as 2 tokens). This can be used as a base for an algorithm > putting away multiple tokens. > > Highlighter is not applicable here, as it works when querying a Lucene > index, not while indexing. > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -----Original Message----- > > From: Furkan KAMACI [mailto:furkankam...@gmail.com] > > Sent: Friday, February 28, 2014 9:23 AM > > To: java-user@lucene.apache.org > > Subject: Re: Lucene Retrieve Previous and Next Tokens At Analyzed Index > > > > Hi; > > > > I want to implement a stemming algorithm for an NLP purpose. I am > > analyzing Turkish language. Turkish is a different kind of language that > is not > > easy to do stemming. For many cases you can just *predict* "root form" > of a > > given word with the help of context. I will just implement a basic > algorithm > > and then change conditions and compare results (I will not use a library > for > > my purpose this is an academic research). > > > > I will take previous 10 tokens and next 10 tokens of a word that starts > with a > > given word as like: *kale* *I will calculate the entropy to guess the > root form > > of a given word. I mean I will resolve disambiguation. > > > > Maybe Highlighter can do what I want if I can say that: get previous 10 > and > > next 10 tokens of matched term? > > > > Thanks; > > Furkan KAMACI > > > > > > 2014-02-28 9:06 GMT+02:00 pravesh <suyalprav...@yahoo.com>: > > > > > Hi, > > > A little bit of details would further help. Any examples? Also what > > > is the use-case for this? > > > > > > > > > Regards > > > > > > > > > > > > -- > > > View this message in context: > > > http://lucene.472066.n3.nabble.com/Lucene-Retrieve-Previous-and-Next- > > T > > > okens-At-Analyzed-Index-tp4120076p4120340.html > > > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >