Obender, I don't think its as difficult as you think. Your filter does
not need to be aware of this issue at all.

In unicode, right-to-left languages are encoded in the data in logical order.
The rendering system is what converts it to display in right-to-left
for RTL languages.

For example in Arabic, "Robert 1234" displays as روبرت 1234
To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
beh, waw, reh

But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

2009/7/20 OBender <osya_ben...@hotmail.com>:
> Hi All!
>
>
>
> Let say I have a filter that produces new tokens based on the original ones.
>
> How bad will it be if my filter sets the start of each token to 0 and end to
> the length of a token?
>
> An example (based on the phrase "How are you?":
>
>
>
> Original token:
>
> [you?] (8,12)
>
>
>
> New tokens:
>
> [you] (0,3)
>
> [?] (0,1)
>
>
>
> It wouldn't be so hard to calculate the right numbers for left to right
> languages and it is a bit more challenging to do it for right to left ones
> but for mixed text it is quite hard.
>
>
>
> Thanks.
>
>



-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to