Re: [ANNOUNCE] Xerces HTML Parser

2002-02-17 Thread Andy Clark
I've posted a bug fix release for the latest version of the NekoHTML parser. This release fixes the following bugs: * Attributes were being removed from all elements in the SAX parser because the HTML parser configuration didn't have a symbol table. (I still don't use a symbol table

Re: [ANNOUNCE] Xerces HTML Parser

2002-02-15 Thread Joseph Kesselman/CAM/Lotus
Thanks; that helps set the context. Pushing the more extreme forms of fixup into a later XNI module does sound entirely reasonable. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

RE: [ANNOUNCE] Xerces HTML Parser

2002-02-15 Thread Mikko Honkala
] > Subject: Re: [ANNOUNCE] Xerces HTML Parser > > > Joseph Kesselman/CAM/Lotus wrote: > > One question: A huge percentage of the files out there which claim to be > > HTML aren't, or at least aren't correct HTML. Browsers are generally very > > forgiving and att

Re: [ANNOUNCE] Xerces HTML Parser

2002-02-14 Thread Glenn Marcy
Andy writes: > So to solve this problem, I wrote a "playback" input stream This sounds like the same problem that we had to solve when the EntityManager$RewindableInputStream class was added. Could you take a look and see how they compare? If they are really doing the same thing then perhaps w

Re: [ANNOUNCE] Xerces HTML Parser

2002-02-14 Thread Andy Clark
Joseph Kesselman/CAM/Lotus wrote: > One question: A huge percentage of the files out there which claim to be > HTML aren't, or at least aren't correct HTML. Browsers are generally very > forgiving and attempt to read past those errors but exactly how they > recover varies from browser to brows

Re: [ANNOUNCE] Xerces HTML Parser

2002-02-14 Thread Joseph Kesselman/CAM/Lotus
I think having an HTML parser available is definitely a Good Idea. One question: A huge percentage of the files out there which claim to be HTML aren't, or at least aren't correct HTML. Browsers are generally very forgiving and attempt to read past those errors but exactly how they recover v

Re: [ANNOUNCE] Xerces HTML Parser

2002-02-14 Thread Andy Clark
It was bugging me that the first version of the NekoHTML parser could only handle the character encoding "Cp1252" (which is the basic Windows encoding), so I updated the code to be able to automatically handle UTF-8 (w/ BOM) and UTF-16. In addition, it can detect the presence of a tag and scan t

[ANNOUNCE] Xerces HTML Parser

2002-02-08 Thread Andy Clark
For a long time users have asked if Xerces can parse HTML files. But since most HTML documents are not well-formed XML documents, it is generally not possible to use a conforming XML parser to read HTML documents. However, the Xerces Native Interface (XNI) that is the foundation of the Xerce