Che Dong wrote: > CJKAnalyser not support single byte-stream, front end interface and > backend indexing process need to transform source into double byte > charactor-stream properly before search/index. > > Please tell me know the output of > http://www.chedong.com/tech/HelloUnicode.java > with javac -encoding=gb2312 and javac -encoding=iso-8859-1
Here's the output. I can see what's wrong in it. Should I put an extra field in the index containing encoding? > > Regards > > Che Dong > > > Daan Hoogland wrote: > >> Jon Schuster wrote: >> >> >>> I didn't need to make any changes to Entities to get Japanese >>> searches working. Are you using the CJKAnalyzer when you perform the >>> search, not only when building the index? >>> >>> >> >> Yes, I use CJKAnalyzer all around. When searching I translate >> character-entities in order to find anything. When displaying search >> results, I don't see anything that looks as being part of an eastern >> character set. instead I see accented latin - and mathematical symbols. >> >> When I don't pass entities by the way things get really nasty: >> query passed: >Î??Âââ< >> char(Î, LATIN_1_SUPPLEMENT) char(?, LATIN_1_SUPPLEMENT) token found >> : >Î< length: 1 >> char(?, LATIN_1_SUPPLEMENT) char(Â, LATIN_1_SUPPLEMENT) char(â, >> LATIN_1_SUPPLEMENT) token found : >Â< length: 1 >> char(â, LATIN_1_SUPPLEMENT) searching contents:"Î Â" >> >> This was a query for two japanese characters. >> >> >>> -----Original Message----- >>> From: Daan Hoogland [mailto:[EMAIL PROTECTED] Sent: Sunday, >>> October 10, 2004 10:48 PM >>> To: Lucene Users List >>> Subject: Re: searching using the CJKAnalyzer >>> Importance: Low >>> >>> >>> Che Dong wrote: >>> >>> >>> >>> >>>> Seem not Analyser problem but html parser charset detecting error. >>>> >>>> Could you show me the detail of the problem? >>>> >>> >>> >>> Thank Che, >>> I got it working by making the decode() from the Entities in demo >>> public. I wrote a scanner to tranlate any entities in the query. >>> I want to translate back to entities in the results, but I'm not >>> sure what the criteria should be. It seems to be just binary data. >>> How to conclude that Â0Å4?Â0â3ÂÂ?Â0â4 means ÃÃÃÂ? >>> >>> >>> >>> >>>> Thanks >>>> >>>> Che Dong >>>> >>>> Daan Hoogland wrote: >>>> >>>> >>>> >>>>> LS, >>>>> in >>>>> http://issues.apache.org/eyebrowse/ReadMsg?listId=30&msgNo=8980 >>>>> Jon Schuster explains how to get a Japanese search system working. >>>>> I followed his advice and got a index that "luke" shows as what I >>>>> expected it to be. >>>>> I don't know how to enter a search so that it gets passed to the >>>>> engine properly. It works in luke but not in weblucene or in my >>>>> own app. >>>>> >>>>> >>>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>>> For additional commands, e-mail: [EMAIL PROTECTED] >>>> >>>> >>>> >>> >>> >>> >>> >>> >>> >> >> >> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt.
>>>>testing1: write hello world to files<<<< [test 1-1]: with system default encoding=Cp1252 string=Hello world ???? length=16 char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[12]='?' byte=22 \u16 short=19990 \u4E16 CJK_UNIFIED_IDEOGRAPHS char[13]='?' byte=76 \u4C short=30028 \u754C CJK_UNIFIED_IDEOGRAPHS char[14]='?' byte=96 \u60 short=20320 \u4F60 CJK_UNIFIED_IDEOGRAPHS char[15]='?' byte=125 \u7D short=22909 \u597D CJK_UNIFIED_IDEOGRAPHS [test 1-2]: getBytes with platform default encoding and decoding as gb2312: string=Hello world ???? length=16 char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[12]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN char[13]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN char[14]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN char[15]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN [test 1-3]: convert string to UTF8 string=Hello world ???? length=16 char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[12]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN char[13]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN char[14]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN char[15]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN >>>>testing2: reading and decoding from files<<<< [test 2-1]: read hello.orig.html: decoding with system default encoding string=Hello world ???? length=16 char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[12]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN char[13]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN char[14]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN char[15]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN [test 2-2]: read hello.gb2312.html: decoding as GB2312 string=Hello world ???? length=16 char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[12]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN char[13]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN char[14]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN char[15]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN [test 2-3]: read hello.utf8.html: decoding as UTF8 string=Hello world ???? length=16 char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[12]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN char[13]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN char[14]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN char[15]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN
>>>>testing1: write hello world to files<<<< [test 1-1]: with system default encoding=Cp1252 string=Hello world ÊÀ½çÄãºÃ length=20 char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[12]='Ê' byte=-54 \uFFFFFFCA short=202 \uCA LATIN_1_SUPPLEMENT char[13]='À' byte=-64 \uFFFFFFC0 short=192 \uC0 LATIN_1_SUPPLEMENT char[14]='½' byte=-67 \uFFFFFFBD short=189 \uBD LATIN_1_SUPPLEMENT char[15]='ç' byte=-25 \uFFFFFFE7 short=231 \uE7 LATIN_1_SUPPLEMENT char[16]='Ä' byte=-60 \uFFFFFFC4 short=196 \uC4 LATIN_1_SUPPLEMENT char[17]='ã' byte=-29 \uFFFFFFE3 short=227 \uE3 LATIN_1_SUPPLEMENT char[18]='º' byte=-70 \uFFFFFFBA short=186 \uBA LATIN_1_SUPPLEMENT char[19]='Ã' byte=-61 \uFFFFFFC3 short=195 \uC3 LATIN_1_SUPPLEMENT [test 1-2]: getBytes with platform default encoding and decoding as gb2312: string=Hello world ???? length=16 char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[12]='?' byte=22 \u16 short=19990 \u4E16 CJK_UNIFIED_IDEOGRAPHS char[13]='?' byte=76 \u4C short=30028 \u754C CJK_UNIFIED_IDEOGRAPHS char[14]='?' byte=96 \u60 short=20320 \u4F60 CJK_UNIFIED_IDEOGRAPHS char[15]='?' byte=125 \u7D short=22909 \u597D CJK_UNIFIED_IDEOGRAPHS [test 1-3]: convert string to UTF8 string=Hello world ä¸–ç•Œä½ å¥½ length=24 char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[12]='ä' byte=-28 \uFFFFFFE4 short=228 \uE4 LATIN_1_SUPPLEMENT char[13]='¸' byte=-72 \uFFFFFFB8 short=184 \uB8 LATIN_1_SUPPLEMENT char[14]='–' byte=19 \u13 short=8211 \u2013 GENERAL_PUNCTUATION char[15]='ç' byte=-25 \uFFFFFFE7 short=231 \uE7 LATIN_1_SUPPLEMENT char[16]='•' byte=34 \u22 short=8226 \u2022 GENERAL_PUNCTUATION char[17]='Œ' byte=82 \u52 short=338 \u152 LATIN_EXTENDED_A char[18]='ä' byte=-28 \uFFFFFFE4 short=228 \uE4 LATIN_1_SUPPLEMENT char[19]='½' byte=-67 \uFFFFFFBD short=189 \uBD LATIN_1_SUPPLEMENT char[20]=' ' byte=-96 \uFFFFFFA0 short=160 \uA0 LATIN_1_SUPPLEMENT char[21]='å' byte=-27 \uFFFFFFE5 short=229 \uE5 LATIN_1_SUPPLEMENT char[22]='¥' byte=-91 \uFFFFFFA5 short=165 \uA5 LATIN_1_SUPPLEMENT char[23]='½' byte=-67 \uFFFFFFBD short=189 \uBD LATIN_1_SUPPLEMENT >>>>testing2: reading and decoding from files<<<< [test 2-1]: read hello.orig.html: decoding with system default encoding string=Hello world ÊÀ½çÄãºÃ length=20 char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[12]='Ê' byte=-54 \uFFFFFFCA short=202 \uCA LATIN_1_SUPPLEMENT char[13]='À' byte=-64 \uFFFFFFC0 short=192 \uC0 LATIN_1_SUPPLEMENT char[14]='½' byte=-67 \uFFFFFFBD short=189 \uBD LATIN_1_SUPPLEMENT char[15]='ç' byte=-25 \uFFFFFFE7 short=231 \uE7 LATIN_1_SUPPLEMENT char[16]='Ä' byte=-60 \uFFFFFFC4 short=196 \uC4 LATIN_1_SUPPLEMENT char[17]='ã' byte=-29 \uFFFFFFE3 short=227 \uE3 LATIN_1_SUPPLEMENT char[18]='º' byte=-70 \uFFFFFFBA short=186 \uBA LATIN_1_SUPPLEMENT char[19]='Ã' byte=-61 \uFFFFFFC3 short=195 \uC3 LATIN_1_SUPPLEMENT [test 2-2]: read hello.gb2312.html: decoding as GB2312 string=Hello world ???? length=16 char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[12]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN char[13]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN char[14]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN char[15]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN [test 2-3]: read hello.utf8.html: decoding as UTF8 string=Hello world ???? length=16 char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[12]='?' byte=22 \u16 short=19990 \u4E16 CJK_UNIFIED_IDEOGRAPHS char[13]='?' byte=76 \u4C short=30028 \u754C CJK_UNIFIED_IDEOGRAPHS char[14]='?' byte=96 \u60 short=20320 \u4F60 CJK_UNIFIED_IDEOGRAPHS char[15]='?' byte=125 \u7D short=22909 \u597D CJK_UNIFIED_IDEOGRAPHS
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]