Thank you. My git bisect just found this change too. :-)

We are processing documents that we have no control over, and some may use
these numeric encodings, so we can't update the documents.

Looking at the XML spec
(https: //www.w3.org/TR/2008/REC-xml-20081126/#NT-EncName), it does say...

   EncName   ::=   [A-Za-z] ([A-Za-z0-9._] | '-')*

...and I assume that that's why the regex in XmlStreamReader was changed.

However, the same part of the XML spec also says...

> It is recommended that character encodings registered (as charsets) with the
> Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just
> listed, be referred to using their registered names; other encodings should
> use names starting with an "x-" prefix.

...and "IANA-CHARSETS" links to
http://www.iana.org/assignments/character-sets/character-sets.xhtml
which has a table of character sets. In that table there are three columns of
possible codes for referring to charsets: "Preferred MIME", "Name", and
"Aliases". The "Aliases" column includes "437" (that's also where "cp437" can
be found).

This seems like a bit of an inconsistency in the XML spec, as there are a
number of "Name" and "Alias" values that don't quite match the definition of
EncName.

Given this inconsistency, and the fact that there are XML documents "in the
wild" that use these encoding names, would it be reasonable to relax the regex
just enough so that it'll work with these other names and aliases?


On Tue, Oct 3, 2023 at 12:24 PM sebb <seb...@gmail.com> wrote:
>
> Just had another look at the class: in 2.13, the regex for matching
> the encoding string was
> Pattern.compile("<\\?xml.*encoding[\\s]*=[\\s]*((?:\".[^\"]*\")|(?:'.[^']*'))",
> Pattern.MULTILINE);
>
> In 2.14, the pattern includes the following matching for the encoding:
> "encoding\\s*=\\s*((?:\"[A-Za-z]([A-Za-z0-9\\._]|-)*\")|(?:'[A-Za-z]([A-Za-z0-9\\\\._]|-)*'))",
>
> This does not allow for an encoding that starts with a digit; i.e. it
> won't match encoding='437'
>
> AFAICT, no supported encodings start with a digit.
>
> The '437' encoding is actually known as 'Cp437':
> https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
> https://docs.oracle.com/en/java/javase/17/intl/supported-encodings.html
>
> Try using 'Cp437' as the encoding.
>
> On Tue, 3 Oct 2023 at 20:01, sebb <seb...@gmail.com> wrote:
> >
> > On Tue, 3 Oct 2023 at 18:05, Laurence Gonsalves
> > <laure...@xenomachina.com> wrote:
> > >
> > > On Tue, Oct 3, 2023 at 1:39 AM sebb <seb...@gmail.com> wrote:
> > > >
> > > > The byte input stream does not carry any encoding information, so the
> > > > XmlStreamReader has to guess what encoding was used.
> > >
> > > Determining what encoding to use when reading XML from a byte stream
> > > is the purpose of XmlStreamReader. From its documentation: "Character
> > > stream that handles all the necessary Voodoo to figure out the charset
> > > encoding of the XML document within the stream."
> > >
> > > What it's supposed to do in this case is use the "encoding='437'" from
> > > the input to determine that the Charset to use when decoding the byte
> > > stream is "437" (aka "code page 437").
> >
> > Sorry, I completely overlooked that.
> >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> > > For additional commands, e-mail: user-h...@commons.apache.org
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> For additional commands, e-mail: user-h...@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org

Reply via email to