Interesting, the question now is why am I seeing (even in println) what I'm seeing :) I'm reading a string from the file which is in UTF-8 encoding. Could this somehow be related...?
-----Original Message----- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Monday, July 20, 2009 3:03 PM To: java-user@lucene.apache.org Subject: Re: question on custom filter Obender, i ran your code and it did what I expected (but not what you pasted): First token is: (טוֹב,0,4) Second token is: (עֶרֶב,5,10) I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results. On Mon, Jul 20, 2009 at 2:53 PM, OBender<osya_ben...@hotmail.com> wrote: > Here is the simple code. If you run it with English and with Hebrew you will > see that in case of English tokens returned from the left of the phrase to > the right and with Hebrew from the right to the left. > > Again I'm talking about tokens not the individual letters here. > > public class XFilter extends TokenFilter > { > protected XFilter( TokenStream tokenStream ) { > super( tokenStream ); > } > > @Override > public Token next( final Token reusableToken ) throws IOException > { > Token nextToken = input.next( reusableToken ); > System.out.println( nextToken != null? nextToken: "" ); > return nextToken; > } > } > > public class SimpleWhitespaceAnalyzer extends Analyzer > { > @Override > public TokenStream tokenStream( final String fieldName, final Reader > reader ) > { > TokenStream ts = new WhitespaceTokenizer( reader ); > ts = new XFilter( ts ); > > return ts; > } > } > > -----Original Message----- > From: Robert Muir [mailto:rcm...@gmail.com] > Sent: Monday, July 20, 2009 2:26 PM > To: java-user@lucene.apache.org > Subject: Re: question on custom filter > > Obender, I think something in your environment / display environment > might be causing some confusion. > > Are you using microsoft windows? If so, please verify that support for > right-to-left languages is enabled [control panel/regional and > language options]. It is possible you are "seeing something different" > because your rendering system is not actually rendering right-to-left > text in right-to-left direction!!!! > > Second, Instead of using a debugger, I would recommend using Luke to > look at resulting tokens from your analyzer. > > On Mon, Jul 20, 2009 at 2:21 PM, OBender<osya_ben...@hotmail.com> wrote: >> This is how it should be written: >> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91 >> >> -----Original Message----- >> From: Robert Muir [mailto:rcm...@gmail.com] >> Sent: Monday, July 20, 2009 2:07 PM >> To: java-user@lucene.apache.org >> Subject: Re: question on custom filter >> >> Obender, This is not true. >> the text you pasted is the following in unicode: >> >> \N{HEBREW LETTER TET} >> \N{HEBREW LETTER VAV} >> \N{HEBREW POINT HOLAM} >> \N{HEBREW LETTER BET} >> \N{SPACE} >> \N{HEBREW LETTER AYIN} >> \N{HEBREW POINT SEGOL} >> \N{HEBREW LETTER RESH} >> \N{HEBREW POINT SEGOL} >> \N{HEBREW LETTER BET} >> >> you can use this utility to see how your text is encoded: >> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91 >> >> For more information on directionality in unicode, see >> http://unicode.org/reports/tr9/ >> >> On Mon, Jul 20, 2009 at 1:59 PM, OBender<osya_ben...@hotmail.com> wrote: >>> Robert, >>> >>> I'm not sure you are correct on this one. >>> >>> If I have a Hebrew phrase: >>> [טוֹב עֶרֶב] >>> Then first token that filter receives is: >>> [עֶרֶב] (0,5) >>> and the second is: >>> [טוֹב] (6,10) >>> Which means that it counts from right to left (words and indexes). >>> >>> Am I missing something? >>> >>> -----Original Message----- >>> From: Robert Muir [mailto:rcm...@gmail.com] >>> Sent: Monday, July 20, 2009 1:43 PM >>> To: java-user@lucene.apache.org >>> Subject: Re: question on custom filter >>> >>> Obender, I don't think its as difficult as you think. Your filter does >>> not need to be aware of this issue at all. >>> >>> In unicode, right-to-left languages are encoded in the data in logical >>> order. >>> The rendering system is what converts it to display in right-to-left >>> for RTL languages. >>> >>> For example in Arabic, "Robert 1234" displays as روبرت 1234 >>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh, >>> beh, waw, reh >>> >>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4. >>> >>> 2009/7/20 OBender <osya_ben...@hotmail.com>: >>>> Hi All! >>>> >>>> >>>> >>>> Let say I have a filter that produces new tokens based on the original >>>> ones. >>>> >>>> How bad will it be if my filter sets the start of each token to 0 and end >>>> to >>>> the length of a token? >>>> >>>> An example (based on the phrase "How are you?": >>>> >>>> >>>> >>>> Original token: >>>> >>>> [you?] (8,12) >>>> >>>> >>>> >>>> New tokens: >>>> >>>> [you] (0,3) >>>> >>>> [?] (0,1) >>>> >>>> >>>> >>>> It wouldn't be so hard to calculate the right numbers for left to right >>>> languages and it is a bit more challenging to do it for right to left ones >>>> but for mixed text it is quite hard. >>>> >>>> >>>> >>>> Thanks. >>>> >>>> >>> >>> >>> >>> -- >>> Robert Muir >>> rcm...@gmail.com >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >> >> >> -- >> Robert Muir >> rcm...@gmail.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > > -- > Robert Muir > rcm...@gmail.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org