Obender, based on your previous comments (that you see text displayed in the wrong order), I again recommend that you enable support for RTL languages in your operating system, as I mentioned earlier... are you using a Windows-based OS, this is not enabled by default!
I think you are seeing things in the incorrect order, and this is causing confusion for you! On Mon, Jul 20, 2009 at 3:02 PM, Robert Muir<rcm...@gmail.com> wrote: > Obender, i ran your code and it did what I expected (but not what you pasted): > > First token is: (טוֹב,0,4) > Second token is: (עֶרֶב,5,10) > > I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results. > > On Mon, Jul 20, 2009 at 2:53 PM, OBender<osya_ben...@hotmail.com> wrote: >> Here is the simple code. If you run it with English and with Hebrew you will >> see that in case of English tokens returned from the left of the phrase to >> the right and with Hebrew from the right to the left. >> >> Again I'm talking about tokens not the individual letters here. >> >> public class XFilter extends TokenFilter >> { >> protected XFilter( TokenStream tokenStream ) { >> super( tokenStream ); >> } >> >> �...@override >> public Token next( final Token reusableToken ) throws IOException >> { >> Token nextToken = input.next( reusableToken ); >> System.out.println( nextToken != null? nextToken: "" ); >> return nextToken; >> } >> } >> >> public class SimpleWhitespaceAnalyzer extends Analyzer >> { >> �...@override >> public TokenStream tokenStream( final String fieldName, final Reader >> reader ) >> { >> TokenStream ts = new WhitespaceTokenizer( reader ); >> ts = new XFilter( ts ); >> >> return ts; >> } >> } >> >> -----Original Message----- >> From: Robert Muir [mailto:rcm...@gmail.com] >> Sent: Monday, July 20, 2009 2:26 PM >> To: java-user@lucene.apache.org >> Subject: Re: question on custom filter >> >> Obender, I think something in your environment / display environment >> might be causing some confusion. >> >> Are you using microsoft windows? If so, please verify that support for >> right-to-left languages is enabled [control panel/regional and >> language options]. It is possible you are "seeing something different" >> because your rendering system is not actually rendering right-to-left >> text in right-to-left direction!!!! >> >> Second, Instead of using a debugger, I would recommend using Luke to >> look at resulting tokens from your analyzer. >> >> On Mon, Jul 20, 2009 at 2:21 PM, OBender<osya_ben...@hotmail.com> wrote: >>> This is how it should be written: >>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91 >>> >>> -----Original Message----- >>> From: Robert Muir [mailto:rcm...@gmail.com] >>> Sent: Monday, July 20, 2009 2:07 PM >>> To: java-user@lucene.apache.org >>> Subject: Re: question on custom filter >>> >>> Obender, This is not true. >>> the text you pasted is the following in unicode: >>> >>> \N{HEBREW LETTER TET} >>> \N{HEBREW LETTER VAV} >>> \N{HEBREW POINT HOLAM} >>> \N{HEBREW LETTER BET} >>> \N{SPACE} >>> \N{HEBREW LETTER AYIN} >>> \N{HEBREW POINT SEGOL} >>> \N{HEBREW LETTER RESH} >>> \N{HEBREW POINT SEGOL} >>> \N{HEBREW LETTER BET} >>> >>> you can use this utility to see how your text is encoded: >>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91 >>> >>> For more information on directionality in unicode, see >>> http://unicode.org/reports/tr9/ >>> >>> On Mon, Jul 20, 2009 at 1:59 PM, OBender<osya_ben...@hotmail.com> wrote: >>>> Robert, >>>> >>>> I'm not sure you are correct on this one. >>>> >>>> If I have a Hebrew phrase: >>>> [טוֹב עֶרֶב] >>>> Then first token that filter receives is: >>>> [עֶרֶב] (0,5) >>>> and the second is: >>>> [טוֹב] (6,10) >>>> Which means that it counts from right to left (words and indexes). >>>> >>>> Am I missing something? >>>> >>>> -----Original Message----- >>>> From: Robert Muir [mailto:rcm...@gmail.com] >>>> Sent: Monday, July 20, 2009 1:43 PM >>>> To: java-user@lucene.apache.org >>>> Subject: Re: question on custom filter >>>> >>>> Obender, I don't think its as difficult as you think. Your filter does >>>> not need to be aware of this issue at all. >>>> >>>> In unicode, right-to-left languages are encoded in the data in logical >>>> order. >>>> The rendering system is what converts it to display in right-to-left >>>> for RTL languages. >>>> >>>> For example in Arabic, "Robert 1234" displays as روبرت 1234 >>>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh, >>>> beh, waw, reh >>>> >>>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4. >>>> >>>> 2009/7/20 OBender <osya_ben...@hotmail.com>: >>>>> Hi All! >>>>> >>>>> >>>>> >>>>> Let say I have a filter that produces new tokens based on the original >>>>> ones. >>>>> >>>>> How bad will it be if my filter sets the start of each token to 0 and end >>>>> to >>>>> the length of a token? >>>>> >>>>> An example (based on the phrase "How are you?": >>>>> >>>>> >>>>> >>>>> Original token: >>>>> >>>>> [you?] (8,12) >>>>> >>>>> >>>>> >>>>> New tokens: >>>>> >>>>> [you] (0,3) >>>>> >>>>> [?] (0,1) >>>>> >>>>> >>>>> >>>>> It wouldn't be so hard to calculate the right numbers for left to right >>>>> languages and it is a bit more challenging to do it for right to left ones >>>>> but for mixed text it is quite hard. >>>>> >>>>> >>>>> >>>>> Thanks. >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Robert Muir >>>> rcm...@gmail.com >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>> >>> >>> >>> -- >>> Robert Muir >>> rcm...@gmail.com >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >> >> >> -- >> Robert Muir >> rcm...@gmail.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > > -- > Robert Muir > rcm...@gmail.com > -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org