Hold on a second, the phrase that you included link to is not in the correct 
order of words!

-----Original Message-----
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Monday, July 20, 2009 2:07 PM
To: java-user@lucene.apache.org
Subject: Re: question on custom filter

Obender, This is not true.
the text you pasted is the following in unicode:

\N{HEBREW LETTER TET}
\N{HEBREW LETTER VAV}
\N{HEBREW POINT HOLAM}
\N{HEBREW LETTER BET}
\N{SPACE}
\N{HEBREW LETTER AYIN}
\N{HEBREW POINT SEGOL}
\N{HEBREW LETTER RESH}
\N{HEBREW POINT SEGOL}
\N{HEBREW LETTER BET}

you can use this utility to see how your text is encoded:
http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

For more information on directionality in unicode, see
http://unicode.org/reports/tr9/

On Mon, Jul 20, 2009 at 1:59 PM, OBender<osya_ben...@hotmail.com> wrote:
> Robert,
>
> I'm not sure you are correct on this one.
>
> If I have a Hebrew phrase:
> [טוֹב עֶרֶב]
> Then first token that filter receives is:
> [עֶרֶב] (0,5)
> and the second is:
> [טוֹב] (6,10)
> Which means that it counts from right to left (words and indexes).
>
> Am I missing something?
>
> -----Original Message-----
> From: Robert Muir [mailto:rcm...@gmail.com]
> Sent: Monday, July 20, 2009 1:43 PM
> To: java-user@lucene.apache.org
> Subject: Re: question on custom filter
>
> Obender, I don't think its as difficult as you think. Your filter does
> not need to be aware of this issue at all.
>
> In unicode, right-to-left languages are encoded in the data in logical order.
> The rendering system is what converts it to display in right-to-left
> for RTL languages.
>
> For example in Arabic, "Robert 1234" displays as روبرت 1234
> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
> beh, waw, reh
>
> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>
> 2009/7/20 OBender <osya_ben...@hotmail.com>:
>> Hi All!
>>
>>
>>
>> Let say I have a filter that produces new tokens based on the original ones.
>>
>> How bad will it be if my filter sets the start of each token to 0 and end to
>> the length of a token?
>>
>> An example (based on the phrase "How are you?":
>>
>>
>>
>> Original token:
>>
>> [you?] (8,12)
>>
>>
>>
>> New tokens:
>>
>> [you] (0,3)
>>
>> [?] (0,1)
>>
>>
>>
>> It wouldn't be so hard to calculate the right numbers for left to right
>> languages and it is a bit more challenging to do it for right to left ones
>> but for mixed text it is quite hard.
>>
>>
>>
>> Thanks.
>>
>>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to