I've posted a bug fix release for the latest version of the
NekoHTML parser. This release fixes the following bugs:
* Attributes were being removed from all elements in the
SAX parser because the HTML parser configuration didn't
have a symbol table. (I still don't use a symbol table
Thanks; that helps set the context. Pushing the more extreme forms of fixup
into a later XNI module does sound entirely reasonable.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
]
> Subject: Re: [ANNOUNCE] Xerces HTML Parser
>
>
> Joseph Kesselman/CAM/Lotus wrote:
> > One question: A huge percentage of the files out there which claim to be
> > HTML aren't, or at least aren't correct HTML. Browsers are generally very
> > forgiving and att
Andy writes:
> So to solve this problem, I wrote a "playback" input stream
This sounds like the same problem that we had to solve when
the EntityManager$RewindableInputStream class was added.
Could you take a look and see how they compare? If they are
really doing the same thing then perhaps w
Joseph Kesselman/CAM/Lotus wrote:
> One question: A huge percentage of the files out there which claim to be
> HTML aren't, or at least aren't correct HTML. Browsers are generally very
> forgiving and attempt to read past those errors but exactly how they
> recover varies from browser to brows
I think having an HTML parser available is definitely a Good Idea.
One question: A huge percentage of the files out there which claim to be
HTML aren't, or at least aren't correct HTML. Browsers are generally very
forgiving and attempt to read past those errors but exactly how they
recover v
It was bugging me that the first version of the NekoHTML parser
could only handle the character encoding "Cp1252" (which is the
basic Windows encoding), so I updated the code to be able to
automatically handle UTF-8 (w/ BOM) and UTF-16. In addition,
it can detect the presence of a tag and scan t
For a long time users have asked if Xerces can parse HTML files.
But since most HTML documents are not well-formed XML documents,
it is generally not possible to use a conforming XML parser to
read HTML documents.
However, the Xerces Native Interface (XNI) that is the foundation
of the Xerce