to Ken:
     I run on SLES 10  and JRE 1.5 , so GB18030 is supported all right.
   
    I found the cause of problem. because the HtmlParser use nekohtml to
parse page, 
   and the nekohtml would parse page' meta  to  get the charset.  
   so when the page have defined charset, the HtmlParser setEncoding is  of
no effect.
  
   Be lucky , nekohtml have provided a feature named "
http://cyberneko.org/html/features/scanner/ignore-specified-charset"; 
   to switch this function. 

   intro. about it in 
http://www.netlikon.de/docs/nekohtml-0.9.5/constant-values.html#org.cyberneko.html.HTMLScanner.IGNORE_SPECIFIED_CHARSET
here. 
    
   we could set that feature to true  after created the DOMFragmentParser,

   DOMFragmentParser parser = new DOMFragmentParser();
   try {
     
parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",true);
    ....
   
   then, we could parse the page in encoding we appointed.
   
   BTW,  in the parseNeko method of HtmlParser (in plugin parse-html )  ,
the following sentence
   will throw exception:

      parser.setFeature("http://apache.org/xml/features/include-comments";, 
              true);
      parser.setFeature("http://apache.org/xml/features/augmentations";, 
              true);

    so , we must put that sentence before it
    
   
      
parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",true);
       parser.setFeature("http://apache.org/xml/features/include-comments";, 
              true);
      parser.setFeature("http://apache.org/xml/features/augmentations";, 
              true);
    

to Kennth Man :
     I hope this message could help you to slove your problem :-)
  
     Hm... Are  you a chinese ?  if you are,  you could PM to  me in
chinese.
     
-- 
View this message in context: 
http://www.nabble.com/Charset-question-tf2231717.html#a6359689
Sent from the Nutch - User forum at Nabble.com.


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to