RE: question on custom filter

OBender Mon, 20 Jul 2009 11:19:50 -0700

Hold on a second, the phrase that you included link to is not in the correct 
order of words!


-----Original Message-----
From: Robert Muir [mailto:[email protected]] 
Sent: Monday, July 20, 2009 2:07 PM
To: [email protected]
Subject: Re: question on custom filter

Obender, This is not true.
the text you pasted is the following in unicode:

\N{HEBREW LETTER TET}
\N{HEBREW LETTER VAV}
\N{HEBREW POINT HOLAM}
\N{HEBREW LETTER BET}
\N{SPACE}
\N{HEBREW LETTER AYIN}
\N{HEBREW POINT SEGOL}
\N{HEBREW LETTER RESH}
\N{HEBREW POINT SEGOL}
\N{HEBREW LETTER BET}

you can use this utility to see how your text is encoded:
http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91

For more information on directionality in unicode, see
http://unicode.org/reports/tr9/

On Mon, Jul 20, 2009 at 1:59 PM, OBender<[email protected]> wrote:
> Robert,
>
> I'm not sure you are correct on this one.
>
> If I have a Hebrew phrase:
> [טוֹב עֶרֶב]
> Then first token that filter receives is:
> [עֶרֶב] (0,5)
> and the second is:
> [טוֹב] (6,10)
> Which means that it counts from right to left (words and indexes).
>
> Am I missing something?
>
> -----Original Message-----
> From: Robert Muir [mailto:[email protected]]
> Sent: Monday, July 20, 2009 1:43 PM
> To: [email protected]
> Subject: Re: question on custom filter
>
> Obender, I don't think its as difficult as you think. Your filter does
> not need to be aware of this issue at all.
>
> In unicode, right-to-left languages are encoded in the data in logical order.
> The rendering system is what converts it to display in right-to-left
> for RTL languages.
>
> For example in Arabic, "Robert 1234" displays as روبرت 1234
> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
> beh, waw, reh
>
> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>
> 2009/7/20 OBender <[email protected]>:
>> Hi All!
>>
>>
>>
>> Let say I have a filter that produces new tokens based on the original ones.
>>
>> How bad will it be if my filter sets the start of each token to 0 and end to
>> the length of a token?
>>
>> An example (based on the phrase "How are you?":
>>
>>
>>
>> Original token:
>>
>> [you?] (8,12)
>>
>>
>>
>> New tokens:
>>
>> [you] (0,3)
>>
>> [?] (0,1)
>>
>>
>>
>> It wouldn't be so hard to calculate the right numbers for left to right
>> languages and it is a bit more challenging to do it for right to left ones
>> but for mixed text it is quite hard.
>>
>>
>>
>> Thanks.
>>
>>
>
>
>
> --
> Robert Muir
> [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>



-- 
Robert Muir
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: question on custom filter

Reply via email to