Created https://issues.apache.org/jira/browse/IO-815

On Tue, 3 Oct 2023 at 21:49, sebb <seb...@gmail.com> wrote:
>
> On Tue, 3 Oct 2023 at 21:35, Laurence Gonsalves
> <laure...@xenomachina.com> wrote:
> >
> > Thank you. My git bisect just found this change too. :-)
> >
> > We are processing documents that we have no control over, and some may use
> > these numeric encodings, so we can't update the documents.
> >
> > Looking at the XML spec
> > (https: //www.w3.org/TR/2008/REC-xml-20081126/#NT-EncName), it does say...
> >
> >    EncName   ::=   [A-Za-z] ([A-Za-z0-9._] | '-')*
> >
> > ...and I assume that that's why the regex in XmlStreamReader was changed.
>
> That would explain the somewhat convoluted RE.
>
> > However, the same part of the XML spec also says...
> >
> > > It is recommended that character encodings registered (as charsets) with 
> > > the
> > > Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just
> > > listed, be referred to using their registered names; other encodings 
> > > should
> > > use names starting with an "x-" prefix.
> >
> > ...and "IANA-CHARSETS" links to
> > http://www.iana.org/assignments/character-sets/character-sets.xhtml
> > which has a table of character sets. In that table there are three columns 
> > of
> > possible codes for referring to charsets: "Preferred MIME", "Name", and
> > "Aliases". The "Aliases" column includes "437" (that's also where "cp437" 
> > can
> > be found).
> >
> > This seems like a bit of an inconsistency in the XML spec, as there are a
> > number of "Name" and "Alias" values that don't quite match the definition of
> > EncName.
>
> Indeed.
> There are also examples such as ISO_8859-1:1987 and ebcdic-de-273+euro
>
> > Given this inconsistency, and the fact that there are XML documents "in the
> > wild" that use these encoding names, would it be reasonable to relax the 
> > regex
> > just enough so that it'll work with these other names and aliases?
>
> I would say yes.
>
> I'm currently working on a test to check that all charsets *and
> aliases* can be matched, and on a fix to the RE.
>
> >
> > On Tue, Oct 3, 2023 at 12:24 PM sebb <seb...@gmail.com> wrote:
> > >
> > > Just had another look at the class: in 2.13, the regex for matching
> > > the encoding string was
> > > Pattern.compile("<\\?xml.*encoding[\\s]*=[\\s]*((?:\".[^\"]*\")|(?:'.[^']*'))",
> > > Pattern.MULTILINE);
> > >
> > > In 2.14, the pattern includes the following matching for the encoding:
> > > "encoding\\s*=\\s*((?:\"[A-Za-z]([A-Za-z0-9\\._]|-)*\")|(?:'[A-Za-z]([A-Za-z0-9\\\\._]|-)*'))",
> > >
> > > This does not allow for an encoding that starts with a digit; i.e. it
> > > won't match encoding='437'
> > >
> > > AFAICT, no supported encodings start with a digit.
> > >
> > > The '437' encoding is actually known as 'Cp437':
> > > https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
> > > https://docs.oracle.com/en/java/javase/17/intl/supported-encodings.html
> > >
> > > Try using 'Cp437' as the encoding.
> > >
> > > On Tue, 3 Oct 2023 at 20:01, sebb <seb...@gmail.com> wrote:
> > > >
> > > > On Tue, 3 Oct 2023 at 18:05, Laurence Gonsalves
> > > > <laure...@xenomachina.com> wrote:
> > > > >
> > > > > On Tue, Oct 3, 2023 at 1:39 AM sebb <seb...@gmail.com> wrote:
> > > > > >
> > > > > > The byte input stream does not carry any encoding information, so 
> > > > > > the
> > > > > > XmlStreamReader has to guess what encoding was used.
> > > > >
> > > > > Determining what encoding to use when reading XML from a byte stream
> > > > > is the purpose of XmlStreamReader. From its documentation: "Character
> > > > > stream that handles all the necessary Voodoo to figure out the charset
> > > > > encoding of the XML document within the stream."
> > > > >
> > > > > What it's supposed to do in this case is use the "encoding='437'" from
> > > > > the input to determine that the Charset to use when decoding the byte
> > > > > stream is "437" (aka "code page 437").
> > > >
> > > > Sorry, I completely overlooked that.
> > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> > > > > For additional commands, e-mail: user-h...@commons.apache.org
> > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> > > For additional commands, e-mail: user-h...@commons.apache.org
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> > For additional commands, e-mail: user-h...@commons.apache.org
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org

Reply via email to