Obender, based on your previous comments (that you see text displayed
in the wrong order), I again recommend that you enable support for RTL
languages in your operating system, as I mentioned earlier... are you
using a Windows-based OS, this is not enabled by default!

I think you are seeing things in the incorrect order, and this is
causing confusion for you!

On Mon, Jul 20, 2009 at 3:02 PM, Robert Muir<rcm...@gmail.com> wrote:
> Obender, i ran your code and it did what I expected (but not what you pasted):
>
> First token is: (טוֹב,0,4)
> Second token is: (עֶרֶב,5,10)
>
> I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.
>
> On Mon, Jul 20, 2009 at 2:53 PM, OBender<osya_ben...@hotmail.com> wrote:
>> Here is the simple code. If you run it with English and with Hebrew you will 
>> see that in case of English tokens returned from the left of the phrase to 
>> the right and with Hebrew from the right to the left.
>>
>> Again I'm talking about tokens not the individual letters here.
>>
>> public class XFilter extends TokenFilter
>> {
>>        protected XFilter( TokenStream tokenStream ) {
>>                super( tokenStream );
>>        }
>>
>>       �...@override
>>        public Token next( final Token reusableToken ) throws IOException
>>        {
>>                Token nextToken = input.next( reusableToken );
>>                System.out.println( nextToken != null? nextToken: "" );
>>                return nextToken;
>>        }
>> }
>>
>> public class SimpleWhitespaceAnalyzer extends Analyzer
>> {
>>       �...@override
>>        public TokenStream tokenStream( final String fieldName, final Reader 
>> reader )
>>        {
>>                TokenStream ts  = new WhitespaceTokenizer( reader );
>>                ts                      = new XFilter( ts );
>>
>>                return ts;
>>        }
>> }
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcm...@gmail.com]
>> Sent: Monday, July 20, 2009 2:26 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: question on custom filter
>>
>> Obender, I think something in your environment / display environment
>> might be causing some confusion.
>>
>> Are you using microsoft windows? If so, please verify that support for
>> right-to-left languages is enabled [control panel/regional and
>> language options]. It is possible you are "seeing something different"
>> because your rendering system is not actually rendering right-to-left
>> text in right-to-left direction!!!!
>>
>> Second, Instead of using a debugger, I would recommend using Luke to
>> look at resulting tokens from your analyzer.
>>
>> On Mon, Jul 20, 2009 at 2:21 PM, OBender<osya_ben...@hotmail.com> wrote:
>>> This is how it should be written:
>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcm...@gmail.com]
>>> Sent: Monday, July 20, 2009 2:07 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: question on custom filter
>>>
>>> Obender, This is not true.
>>> the text you pasted is the following in unicode:
>>>
>>> \N{HEBREW LETTER TET}
>>> \N{HEBREW LETTER VAV}
>>> \N{HEBREW POINT HOLAM}
>>> \N{HEBREW LETTER BET}
>>> \N{SPACE}
>>> \N{HEBREW LETTER AYIN}
>>> \N{HEBREW POINT SEGOL}
>>> \N{HEBREW LETTER RESH}
>>> \N{HEBREW POINT SEGOL}
>>> \N{HEBREW LETTER BET}
>>>
>>> you can use this utility to see how your text is encoded:
>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91
>>>
>>> For more information on directionality in unicode, see
>>> http://unicode.org/reports/tr9/
>>>
>>> On Mon, Jul 20, 2009 at 1:59 PM, OBender<osya_ben...@hotmail.com> wrote:
>>>> Robert,
>>>>
>>>> I'm not sure you are correct on this one.
>>>>
>>>> If I have a Hebrew phrase:
>>>> [טוֹב עֶרֶב]
>>>> Then first token that filter receives is:
>>>> [עֶרֶב] (0,5)
>>>> and the second is:
>>>> [טוֹב] (6,10)
>>>> Which means that it counts from right to left (words and indexes).
>>>>
>>>> Am I missing something?
>>>>
>>>> -----Original Message-----
>>>> From: Robert Muir [mailto:rcm...@gmail.com]
>>>> Sent: Monday, July 20, 2009 1:43 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: question on custom filter
>>>>
>>>> Obender, I don't think its as difficult as you think. Your filter does
>>>> not need to be aware of this issue at all.
>>>>
>>>> In unicode, right-to-left languages are encoded in the data in logical 
>>>> order.
>>>> The rendering system is what converts it to display in right-to-left
>>>> for RTL languages.
>>>>
>>>> For example in Arabic, "Robert 1234" displays as روبرت 1234
>>>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
>>>> beh, waw, reh
>>>>
>>>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>>>>
>>>> 2009/7/20 OBender <osya_ben...@hotmail.com>:
>>>>> Hi All!
>>>>>
>>>>>
>>>>>
>>>>> Let say I have a filter that produces new tokens based on the original 
>>>>> ones.
>>>>>
>>>>> How bad will it be if my filter sets the start of each token to 0 and end 
>>>>> to
>>>>> the length of a token?
>>>>>
>>>>> An example (based on the phrase "How are you?":
>>>>>
>>>>>
>>>>>
>>>>> Original token:
>>>>>
>>>>> [you?] (8,12)
>>>>>
>>>>>
>>>>>
>>>>> New tokens:
>>>>>
>>>>> [you] (0,3)
>>>>>
>>>>> [?] (0,1)
>>>>>
>>>>>
>>>>>
>>>>> It wouldn't be so hard to calculate the right numbers for left to right
>>>>> languages and it is a bit more challenging to do it for right to left ones
>>>>> but for mixed text it is quite hard.
>>>>>
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Robert Muir
>>>> rcm...@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcm...@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcm...@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>



-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to