----- Original Message ----- From: "Jingkang Zhang" <[EMAIL PROTECTED]>
To: <lucene-user@jakarta.apache.org>
Sent: Friday, February 18, 2005 5:12 PM
Subject: The problem of using Cyber Neko HTML Parser parse HTML files
When I was using Cyber Neko HTML Parser parse HTML files( created by Microsoft word ), if the file contains HTML built-in entity references(for example: ) , node value may contain unknown character.
Like this: source html: <DIV> <P class=MsoNormal style="MARGIN: 0cm 0cm 0pt 18pt"><SPAN lang=EN-US style="mso-bidi-font-size: 10.5pt"><FONT face="Times New Roman"><FONT size=3>-rw-r--r--<SPAN style="mso-spacerun: yes"> </SPAN>1 root<SPAN style="mso-spacerun: yes"> </SPAN>root<SPAN style="mso-spacerun: yes"> </SPAN>50 Jan 21 16:12 _1e.f6<o:p></o:p></FONT></FONT></SPAN></P> </DIV>
after parsing html: -rw-r--r--ç?1 rootçç rootççççç 50 Jan 21 16:12 _1e.f6
How can I avoid it?
_________________________________________________________ Do You Yahoo!? 150äæMP3ççæïåæéåéäæå http://music.yisou.com/ çåææåæåæïæéçåãèååéå http://image.yisou.com 1Gåæ1000åïéèçéèåæåï http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]