On Sun, Jan 16, 2011 at 7:37 PM, Trejkaz <[email protected]> wrote:
> So I guess I have two questions:
>    1. Is there some way to do filtering to the text before
> tokenisation without upsetting the offsets reported by the tokeniser?
>    2. Is there some more general solution to this problem, such as an
> existing tokeniser similar to StandardTokeniser but with better
> Unicode awareness?
>

Hi, I think you want to try the StandardTokenizer in 3.1 (make sure
you pass Version.LUCENE_31 to get the new behavior)
It implements UAX#29 algorithm which respects canonical equivalence...
it sounds like thats what you want.

http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to