> 
> I am trying to use the javax.swing.text.html.HTMLEditorKit.Parser to
> parse some web pages (using code from Elliotte Harold's "Java Network
> Programming" 2nd ed (oreilly).
> 
> I gets javax.swing.text.ChangedCharSetException on most pages because
> (it seems) Netscape Composer embeds this in each page:
> 
> <head>
>    <meta http-equiv="Content-Type" content="text/html;
 charset=iso-8859-1">
>    <meta name="GENERATOR" content="Mozilla/4.76 [en] (Windows NT 5.0; U)
 [Netscape]">
>    <meta name="Author" content="John Caron">
>    <title>GDV WebStart</title>
> </head>
> 
> and it seems that "text/html; charset=iso-8859-1" always causes an
> Exception.
> 


Here's a brute-force workaround for this kind of problem, i just remove the offending 
lines before parsing the HTML. If anyone has more info on what 
HTMLEditorKit.ParserCallback can and cant do, or better ways to parse HTML, please 
post.

    ...
    baseURL = new URL(urlName);
    InputStream in = baseURL.openStream();
    InputStreamReader r = new InputStreamReader(filterTag(in));
    HTMLEditorKit.ParserCallback callback = new MyCallerBacker();
    parser.parse(r, callback, false);
    ...

     // workaround for HTMLEditorKit.Parser, cant deal with "content-encoding"
  private InputStream filterTag(InputStream in) throws IOException {
    DataInputStream dins = new DataInputStream( in);
    ByteArrayOutputStream bos = new ByteArrayOutputStream(10000);

    DataInputStream din =  new DataInputStream(new BufferedInputStream(in));
    while (din.available() > 0) {
      String line = din.readLine();
      String lline = line.toLowerCase();
      if (0 <= lline.indexOf("<meta "))  // skip meta tags
        continue;
      //System.out.println("--"+line);
      bos.write( line.getBytes());
    }
    din.close();

    return new ByteArrayInputStream( bos.toByteArray());
  }

_______________________________________________
Advanced-swing mailing list
[EMAIL PROTECTED]
http://eos.dk/mailman/listinfo/advanced-swing

Reply via email to