What you are looking for is OffsetAttribute. Also consider the possibility of using ShingleFilter with position increment > 1 and then filtering tokens containing "_" (underscore). This will be easier, I guess.
On Jan 11, 2013, at 7:14 AM, Igal @ getRailo.org <[email protected]> wrote: > hi all, > > how can I get the Token's Position from the TokenStream / Tokenizer / > Analyzer ? I know that there's a TokenPositionIncrement Attribute and a > TokenPositionLength Attribute, but is there an easy way to get the token > position or do I need to implement my own attribute by adding one of the > attributes mentioned above? > > the reason I need it is that I wrote an implementation of a ShingleFilter > which breaks shingles at punctuations so the tokens [token number one, word > two] will create the shingles [ "token number", "number one", "word two" ] -- > but Not [ "one word" ] because of the comma. I want it to break shingles at > increment gaps as well. > > thanks, > > > Igal > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --- Denis Bazhenov <[email protected]> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
