Re: Unicode normalisation before tokenisation?

Robert Muir Sun, 16 Jan 2011 16:54:25 -0800

On Sun, Jan 16, 2011 at 7:37 PM, Trejkaz <[email protected]> wrote:
> So I guess I have two questions:
>    1. Is there some way to do filtering to the text before
> tokenisation without upsetting the offsets reported by the tokeniser?
>    2. Is there some more general solution to this problem, such as an
> existing tokeniser similar to StandardTokeniser but with better
> Unicode awareness?
>


Hi, I think you want to try the StandardTokenizer in 3.1 (make sure
you pass Version.LUCENE_31 to get the new behavior)
It implements UAX#29 algorithm which respects canonical equivalence...
it sounds like thats what you want.

http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Unicode normalisation *before* tokenisation?

Reply via email to

Re: Unicode normalisation before tokenisation?