RE: question on custom filter

OBender Mon, 20 Jul 2009 12:30:21 -0700

Interesting, the question now is why am I seeing (even in println) what I'm 
seeing :)
I'm reading a string from the file which is in UTF-8 encoding. Could this 
somehow be related...?


-----Original Message-----
From: Robert Muir [mailto:[email protected]] 
Sent: Monday, July 20, 2009 3:03 PM
To: [email protected]
Subject: Re: question on custom filter

Obender, i ran your code and it did what I expected (but not what you pasted):

First token is: (טוֹב,0,4)
Second token is: (עֶרֶב,5,10)

I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results.

On Mon, Jul 20, 2009 at 2:53 PM, OBender<[email protected]> wrote:
> Here is the simple code. If you run it with English and with Hebrew you will 
> see that in case of English tokens returned from the left of the phrase to 
> the right and with Hebrew from the right to the left.
>
> Again I'm talking about tokens not the individual letters here.
>
> public class XFilter extends TokenFilter
> {
>        protected XFilter( TokenStream tokenStream ) {
>                super( tokenStream );
>        }
>
>        @Override
>        public Token next( final Token reusableToken ) throws IOException
>        {
>                Token nextToken = input.next( reusableToken );
>                System.out.println( nextToken != null? nextToken: "" );
>                return nextToken;
>        }
> }
>
> public class SimpleWhitespaceAnalyzer extends Analyzer
> {
>        @Override
>        public TokenStream tokenStream( final String fieldName, final Reader 
> reader )
>        {
>                TokenStream ts  = new WhitespaceTokenizer( reader );
>                ts                      = new XFilter( ts );
>
>                return ts;
>        }
> }
>
> -----Original Message-----
> From: Robert Muir [mailto:[email protected]]
> Sent: Monday, July 20, 2009 2:26 PM
> To: [email protected]
> Subject: Re: question on custom filter
>
> Obender, I think something in your environment / display environment
> might be causing some confusion.
>
> Are you using microsoft windows? If so, please verify that support for
> right-to-left languages is enabled [control panel/regional and
> language options]. It is possible you are "seeing something different"
> because your rendering system is not actually rendering right-to-left
> text in right-to-left direction!!!!
>
> Second, Instead of using a debugger, I would recommend using Luke to
> look at resulting tokens from your analyzer.
>
> On Mon, Jul 20, 2009 at 2:21 PM, OBender<[email protected]> wrote:
>> This is how it should be written:
>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:[email protected]]
>> Sent: Monday, July 20, 2009 2:07 PM
>> To: [email protected]
>> Subject: Re: question on custom filter
>>
>> Obender, This is not true.
>> the text you pasted is the following in unicode:
>>
>> \N{HEBREW LETTER TET}
>> \N{HEBREW LETTER VAV}
>> \N{HEBREW POINT HOLAM}
>> \N{HEBREW LETTER BET}
>> \N{SPACE}
>> \N{HEBREW LETTER AYIN}
>> \N{HEBREW POINT SEGOL}
>> \N{HEBREW LETTER RESH}
>> \N{HEBREW POINT SEGOL}
>> \N{HEBREW LETTER BET}
>>
>> you can use this utility to see how your text is encoded:
>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91
>>
>> For more information on directionality in unicode, see
>> http://unicode.org/reports/tr9/
>>
>> On Mon, Jul 20, 2009 at 1:59 PM, OBender<[email protected]> wrote:
>>> Robert,
>>>
>>> I'm not sure you are correct on this one.
>>>
>>> If I have a Hebrew phrase:
>>> [טוֹב עֶרֶב]
>>> Then first token that filter receives is:
>>> [עֶרֶב] (0,5)
>>> and the second is:
>>> [טוֹב] (6,10)
>>> Which means that it counts from right to left (words and indexes).
>>>
>>> Am I missing something?
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:[email protected]]
>>> Sent: Monday, July 20, 2009 1:43 PM
>>> To: [email protected]
>>> Subject: Re: question on custom filter
>>>
>>> Obender, I don't think its as difficult as you think. Your filter does
>>> not need to be aware of this issue at all.
>>>
>>> In unicode, right-to-left languages are encoded in the data in logical 
>>> order.
>>> The rendering system is what converts it to display in right-to-left
>>> for RTL languages.
>>>
>>> For example in Arabic, "Robert 1234" displays as روبرت 1234
>>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
>>> beh, waw, reh
>>>
>>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>>>
>>> 2009/7/20 OBender <[email protected]>:
>>>> Hi All!
>>>>
>>>>
>>>>
>>>> Let say I have a filter that produces new tokens based on the original 
>>>> ones.
>>>>
>>>> How bad will it be if my filter sets the start of each token to 0 and end 
>>>> to
>>>> the length of a token?
>>>>
>>>> An example (based on the phrase "How are you?":
>>>>
>>>>
>>>>
>>>> Original token:
>>>>
>>>> [you?] (8,12)
>>>>
>>>>
>>>>
>>>> New tokens:
>>>>
>>>> [you] (0,3)
>>>>
>>>> [?] (0,1)
>>>>
>>>>
>>>>
>>>> It wouldn't be so hard to calculate the right numbers for left to right
>>>> languages and it is a bit more challenging to do it for right to left ones
>>>> but for mixed text it is quite hard.
>>>>
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> [email protected]
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> [email protected]
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
>
>
> --
> Robert Muir
> [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>



-- 
Robert Muir
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: question on custom filter

Reply via email to