Hold on a second, the phrase that you included link to is not in the correct order of words!
-----Original Message----- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Monday, July 20, 2009 2:07 PM To: java-user@lucene.apache.org Subject: Re: question on custom filter Obender, This is not true. the text you pasted is the following in unicode: \N{HEBREW LETTER TET} \N{HEBREW LETTER VAV} \N{HEBREW POINT HOLAM} \N{HEBREW LETTER BET} \N{SPACE} \N{HEBREW LETTER AYIN} \N{HEBREW POINT SEGOL} \N{HEBREW LETTER RESH} \N{HEBREW POINT SEGOL} \N{HEBREW LETTER BET} you can use this utility to see how your text is encoded: http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91 For more information on directionality in unicode, see http://unicode.org/reports/tr9/ On Mon, Jul 20, 2009 at 1:59 PM, OBender<osya_ben...@hotmail.com> wrote: > Robert, > > I'm not sure you are correct on this one. > > If I have a Hebrew phrase: > [טוֹב עֶרֶב] > Then first token that filter receives is: > [עֶרֶב] (0,5) > and the second is: > [טוֹב] (6,10) > Which means that it counts from right to left (words and indexes). > > Am I missing something? > > -----Original Message----- > From: Robert Muir [mailto:rcm...@gmail.com] > Sent: Monday, July 20, 2009 1:43 PM > To: java-user@lucene.apache.org > Subject: Re: question on custom filter > > Obender, I don't think its as difficult as you think. Your filter does > not need to be aware of this issue at all. > > In unicode, right-to-left languages are encoded in the data in logical order. > The rendering system is what converts it to display in right-to-left > for RTL languages. > > For example in Arabic, "Robert 1234" displays as روبرت 1234 > To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh, > beh, waw, reh > > But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4. > > 2009/7/20 OBender <osya_ben...@hotmail.com>: >> Hi All! >> >> >> >> Let say I have a filter that produces new tokens based on the original ones. >> >> How bad will it be if my filter sets the start of each token to 0 and end to >> the length of a token? >> >> An example (based on the phrase "How are you?": >> >> >> >> Original token: >> >> [you?] (8,12) >> >> >> >> New tokens: >> >> [you] (0,3) >> >> [?] (0,1) >> >> >> >> It wouldn't be so hard to calculate the right numbers for left to right >> languages and it is a bit more challenging to do it for right to left ones >> but for mixed text it is quite hard. >> >> >> >> Thanks. >> >> > > > > -- > Robert Muir > rcm...@gmail.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org