Re: searching using the CJKAnalyzer

Daan Hoogland Tue, 12 Oct 2004 05:57:40 -0700

Che Dong wrote:

> CJKAnalyser not support single byte-stream, front end interface and 
> backend indexing process need to transform source into double byte 
> charactor-stream properly before search/index.
>
> Please tell me know the output of
> http://www.chedong.com/tech/HelloUnicode.java
> with javac -encoding=gb2312 and javac -encoding=iso-8859-1


Here's the output. I can see what's wrong in it. Should I put an extra 
field in the index containing encoding?

>
> Regards
>
> Che Dong
>
>
> Daan Hoogland wrote:
>
>> Jon Schuster wrote:
>>
>>
>>> I didn't need to make any changes to Entities to get Japanese 
>>> searches working. Are you using the CJKAnalyzer when you perform the 
>>> search, not only when building the index?
>>>
>>>
>>
>> Yes, I use CJKAnalyzer all around. When searching I translate 
>> character-entities in order to find anything. When displaying search 
>> results, I don't see anything that looks as being part of an eastern 
>> character set. instead I see accented latin - and mathematical symbols.
>>
>> When I don't pass entities by the way things get really nasty:
>> query passed: >Î??Âââ<
>>  char(Î, LATIN_1_SUPPLEMENT)  char(?, LATIN_1_SUPPLEMENT) token found 
>> :  >Î< length: 1
>>  char(?, LATIN_1_SUPPLEMENT)  char(Â, LATIN_1_SUPPLEMENT)  char(â, 
>> LATIN_1_SUPPLEMENT) token found : >Â< length: 1
>>  char(â, LATIN_1_SUPPLEMENT) searching contents:"Î Â"
>>
>> This was a query for two japanese characters.
>>
>>
>>> -----Original Message-----
>>> From: Daan Hoogland [mailto:[EMAIL PROTECTED] Sent: Sunday, 
>>> October 10, 2004 10:48 PM
>>> To: Lucene Users List
>>> Subject: Re: searching using the CJKAnalyzer
>>> Importance: Low
>>>
>>>
>>> Che Dong wrote:
>>>
>>>
>>>
>>>
>>>> Seem not Analyser problem but html parser charset detecting error.
>>>>
>>>> Could you show me the detail of the problem?
>>>>  
>>>
>>>
>>> Thank Che,
>>> I got it working by making the decode() from the Entities in demo 
>>> public. I wrote a scanner to tranlate any entities in the query.
>>> I want to translate back to entities in the results, but I'm not 
>>> sure what the criteria should be. It seems to be just binary data.
>>> How to conclude that Â0Å4?Â0â3ÂÂ?Â0â4 means ÃÃÃÂ?
>>>
>>>
>>>
>>>
>>>> Thanks
>>>>
>>>> Che Dong
>>>>
>>>> Daan Hoogland wrote:
>>>>
>>>>  
>>>>
>>>>> LS,
>>>>> in
>>>>> http://issues.apache.org/eyebrowse/ReadMsg?listId=30&msgNo=8980
>>>>> Jon Schuster explains how to get a Japanese search system working. 
>>>>> I followed his advice and got a index that "luke" shows as what I 
>>>>> expected it to be.
>>>>> I don't know how to enter a search so that it gets passed to the 
>>>>> engine properly. It works in luke but not in weblucene or in my 
>>>>> own app.
>>>>>
>>>>>
>>>>>    
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>
>>>>
>>>>  
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 
The information contained in this communication and any attachments is confidential 
and may be privileged, and is for the sole use of the intended recipient(s). Any 
unauthorized review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please notify the sender immediately by replying to this message 
and destroy all copies of this message and any attachments. ASML is neither liable for 
the proper and complete transmission of the information contained in this 
communication, nor for any delay in its receipt.

>>>>testing1: write hello world to files<<<<
[test 1-1]: with system default encoding=Cp1252
string=Hello world ???? length=16
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='?'    byte=22 \u16    short=19990 \u4E16      CJK_UNIFIED_IDEOGRAPHS
char[13]='?'    byte=76 \u4C    short=30028 \u754C      CJK_UNIFIED_IDEOGRAPHS
char[14]='?'    byte=96 \u60    short=20320 \u4F60      CJK_UNIFIED_IDEOGRAPHS
char[15]='?'    byte=125 \u7D   short=22909 \u597D      CJK_UNIFIED_IDEOGRAPHS

[test 1-2]: getBytes with platform default encoding and decoding as gb2312:
string=Hello world ???? length=16
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[13]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[14]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[15]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN

[test 1-3]: convert string to UTF8
string=Hello world ???? length=16
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[13]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[14]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[15]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN

>>>>testing2: reading and decoding from files<<<<
[test 2-1]: read hello.orig.html: decoding with system default encoding
string=Hello world ???? length=16
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[13]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[14]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[15]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN

[test 2-2]: read hello.gb2312.html: decoding as GB2312
string=Hello world ???? length=16
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[13]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[14]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[15]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN

[test 2-3]: read hello.utf8.html: decoding as UTF8
string=Hello world ???? length=16
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[13]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[14]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[15]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN

>>>>testing1: write hello world to files<<<<
[test 1-1]: with system default encoding=Cp1252
string=Hello world КАЅзДгєГ     length=20
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='К'    byte=-54 \uFFFFFFCA     short=202 \uCA  LATIN_1_SUPPLEMENT
char[13]='А'    byte=-64 \uFFFFFFC0     short=192 \uC0  LATIN_1_SUPPLEMENT
char[14]='Ѕ'    byte=-67 \uFFFFFFBD     short=189 \uBD  LATIN_1_SUPPLEMENT
char[15]='з'    byte=-25 \uFFFFFFE7     short=231 \uE7  LATIN_1_SUPPLEMENT
char[16]='Д'    byte=-60 \uFFFFFFC4     short=196 \uC4  LATIN_1_SUPPLEMENT
char[17]='г'    byte=-29 \uFFFFFFE3     short=227 \uE3  LATIN_1_SUPPLEMENT
char[18]='є'    byte=-70 \uFFFFFFBA     short=186 \uBA  LATIN_1_SUPPLEMENT
char[19]='Г'    byte=-61 \uFFFFFFC3     short=195 \uC3  LATIN_1_SUPPLEMENT

[test 1-2]: getBytes with platform default encoding and decoding as gb2312:
string=Hello world ???? length=16
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='?'    byte=22 \u16    short=19990 \u4E16      CJK_UNIFIED_IDEOGRAPHS
char[13]='?'    byte=76 \u4C    short=30028 \u754C      CJK_UNIFIED_IDEOGRAPHS
char[14]='?'    byte=96 \u60    short=20320 \u4F60      CJK_UNIFIED_IDEOGRAPHS
char[15]='?'    byte=125 \u7D   short=22909 \u597D      CJK_UNIFIED_IDEOGRAPHS

[test 1-3]: convert string to UTF8
string=Hello world дё–з•ЊдЅ еҐЅ length=24
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='д'    byte=-28 \uFFFFFFE4     short=228 \uE4  LATIN_1_SUPPLEMENT
char[13]='ё'    byte=-72 \uFFFFFFB8     short=184 \uB8  LATIN_1_SUPPLEMENT
char[14]='–'    byte=19 \u13    short=8211 \u2013       GENERAL_PUNCTUATION
char[15]='з'    byte=-25 \uFFFFFFE7     short=231 \uE7  LATIN_1_SUPPLEMENT
char[16]='•'    byte=34 \u22    short=8226 \u2022       GENERAL_PUNCTUATION
char[17]='Њ'    byte=82 \u52    short=338 \u152 LATIN_EXTENDED_A
char[18]='д'    byte=-28 \uFFFFFFE4     short=228 \uE4  LATIN_1_SUPPLEMENT
char[19]='Ѕ'    byte=-67 \uFFFFFFBD     short=189 \uBD  LATIN_1_SUPPLEMENT
char[20]=' '    byte=-96 \uFFFFFFA0     short=160 \uA0  LATIN_1_SUPPLEMENT
char[21]='е'    byte=-27 \uFFFFFFE5     short=229 \uE5  LATIN_1_SUPPLEMENT
char[22]='Ґ'    byte=-91 \uFFFFFFA5     short=165 \uA5  LATIN_1_SUPPLEMENT
char[23]='Ѕ'    byte=-67 \uFFFFFFBD     short=189 \uBD  LATIN_1_SUPPLEMENT

>>>>testing2: reading and decoding from files<<<<
[test 2-1]: read hello.orig.html: decoding with system default encoding
string=Hello world КАЅзДгєГ     length=20
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='К'    byte=-54 \uFFFFFFCA     short=202 \uCA  LATIN_1_SUPPLEMENT
char[13]='А'    byte=-64 \uFFFFFFC0     short=192 \uC0  LATIN_1_SUPPLEMENT
char[14]='Ѕ'    byte=-67 \uFFFFFFBD     short=189 \uBD  LATIN_1_SUPPLEMENT
char[15]='з'    byte=-25 \uFFFFFFE7     short=231 \uE7  LATIN_1_SUPPLEMENT
char[16]='Д'    byte=-60 \uFFFFFFC4     short=196 \uC4  LATIN_1_SUPPLEMENT
char[17]='г'    byte=-29 \uFFFFFFE3     short=227 \uE3  LATIN_1_SUPPLEMENT
char[18]='є'    byte=-70 \uFFFFFFBA     short=186 \uBA  LATIN_1_SUPPLEMENT
char[19]='Г'    byte=-61 \uFFFFFFC3     short=195 \uC3  LATIN_1_SUPPLEMENT

[test 2-2]: read hello.gb2312.html: decoding as GB2312
string=Hello world ???? length=16
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[13]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[14]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN
char[15]='?'    byte=63 \u3F    short=63 \u3F   BASIC_LATIN

[test 2-3]: read hello.utf8.html: decoding as UTF8
string=Hello world ???? length=16
char[0]='H'     byte=72 \u48    short=72 \u48   BASIC_LATIN
char[1]='e'     byte=101 \u65   short=101 \u65  BASIC_LATIN
char[2]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[3]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[4]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[5]=' '     byte=32 \u20    short=32 \u20   BASIC_LATIN
char[6]='w'     byte=119 \u77   short=119 \u77  BASIC_LATIN
char[7]='o'     byte=111 \u6F   short=111 \u6F  BASIC_LATIN
char[8]='r'     byte=114 \u72   short=114 \u72  BASIC_LATIN
char[9]='l'     byte=108 \u6C   short=108 \u6C  BASIC_LATIN
char[10]='d'    byte=100 \u64   short=100 \u64  BASIC_LATIN
char[11]=' '    byte=32 \u20    short=32 \u20   BASIC_LATIN
char[12]='?'    byte=22 \u16    short=19990 \u4E16      CJK_UNIFIED_IDEOGRAPHS
char[13]='?'    byte=76 \u4C    short=30028 \u754C      CJK_UNIFIED_IDEOGRAPHS
char[14]='?'    byte=96 \u60    short=20320 \u4F60      CJK_UNIFIED_IDEOGRAPHS
char[15]='?'    byte=125 \u7D   short=22909 \u597D      CJK_UNIFIED_IDEOGRAPHS

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: searching using the CJKAnalyzer

Reply via email to