Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread sebb
On Tue, 3 Oct 2023 at 22:17, Laurence Gonsalves
 wrote:
>
> On Tue, Oct 3, 2023 at 1:50 PM sebb  wrote:
> > > Given this inconsistency, and the fact that there are XML documents "in 
> > > the
> > > wild" that use these encoding names, would it be reasonable to relax the 
> > > regex
> > > just enough so that it'll work with these other names and aliases?
> >
> > I would say yes.
> >
> > I'm currently working on a test to check that all charsets *and
> > aliases* can be matched, and on a fix to the RE.

Test and fix have been committed.

> Excellent! Thank you!

Thank you for raising the issue.

> -
> To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> For additional commands, e-mail: user-h...@commons.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org



Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread Laurence Gonsalves
On Tue, Oct 3, 2023 at 1:50 PM sebb  wrote:
> > Given this inconsistency, and the fact that there are XML documents "in the
> > wild" that use these encoding names, would it be reasonable to relax the 
> > regex
> > just enough so that it'll work with these other names and aliases?
>
> I would say yes.
>
> I'm currently working on a test to check that all charsets *and
> aliases* can be matched, and on a fix to the RE.

Excellent! Thank you!

-
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org



Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread sebb
Created https://issues.apache.org/jira/browse/IO-815

On Tue, 3 Oct 2023 at 21:49, sebb  wrote:
>
> On Tue, 3 Oct 2023 at 21:35, Laurence Gonsalves
>  wrote:
> >
> > Thank you. My git bisect just found this change too. :-)
> >
> > We are processing documents that we have no control over, and some may use
> > these numeric encodings, so we can't update the documents.
> >
> > Looking at the XML spec
> > (https: //www.w3.org/TR/2008/REC-xml-20081126/#NT-EncName), it does say...
> >
> >EncName   ::=   [A-Za-z] ([A-Za-z0-9._] | '-')*
> >
> > ...and I assume that that's why the regex in XmlStreamReader was changed.
>
> That would explain the somewhat convoluted RE.
>
> > However, the same part of the XML spec also says...
> >
> > > It is recommended that character encodings registered (as charsets) with 
> > > the
> > > Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just
> > > listed, be referred to using their registered names; other encodings 
> > > should
> > > use names starting with an "x-" prefix.
> >
> > ...and "IANA-CHARSETS" links to
> > http://www.iana.org/assignments/character-sets/character-sets.xhtml
> > which has a table of character sets. In that table there are three columns 
> > of
> > possible codes for referring to charsets: "Preferred MIME", "Name", and
> > "Aliases". The "Aliases" column includes "437" (that's also where "cp437" 
> > can
> > be found).
> >
> > This seems like a bit of an inconsistency in the XML spec, as there are a
> > number of "Name" and "Alias" values that don't quite match the definition of
> > EncName.
>
> Indeed.
> There are also examples such as ISO_8859-1:1987 and ebcdic-de-273+euro
>
> > Given this inconsistency, and the fact that there are XML documents "in the
> > wild" that use these encoding names, would it be reasonable to relax the 
> > regex
> > just enough so that it'll work with these other names and aliases?
>
> I would say yes.
>
> I'm currently working on a test to check that all charsets *and
> aliases* can be matched, and on a fix to the RE.
>
> >
> > On Tue, Oct 3, 2023 at 12:24 PM sebb  wrote:
> > >
> > > Just had another look at the class: in 2.13, the regex for matching
> > > the encoding string was
> > > Pattern.compile("<\\?xml.*encoding[\\s]*=[\\s]*((?:\".[^\"]*\")|(?:'.[^']*'))",
> > > Pattern.MULTILINE);
> > >
> > > In 2.14, the pattern includes the following matching for the encoding:
> > > "encoding\\s*=\\s*((?:\"[A-Za-z]([A-Za-z0-9\\._]|-)*\")|(?:'[A-Za-z]([A-Za-z0-9._]|-)*'))",
> > >
> > > This does not allow for an encoding that starts with a digit; i.e. it
> > > won't match encoding='437'
> > >
> > > AFAICT, no supported encodings start with a digit.
> > >
> > > The '437' encoding is actually known as 'Cp437':
> > > https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
> > > https://docs.oracle.com/en/java/javase/17/intl/supported-encodings.html
> > >
> > > Try using 'Cp437' as the encoding.
> > >
> > > On Tue, 3 Oct 2023 at 20:01, sebb  wrote:
> > > >
> > > > On Tue, 3 Oct 2023 at 18:05, Laurence Gonsalves
> > > >  wrote:
> > > > >
> > > > > On Tue, Oct 3, 2023 at 1:39 AM sebb  wrote:
> > > > > >
> > > > > > The byte input stream does not carry any encoding information, so 
> > > > > > the
> > > > > > XmlStreamReader has to guess what encoding was used.
> > > > >
> > > > > Determining what encoding to use when reading XML from a byte stream
> > > > > is the purpose of XmlStreamReader. From its documentation: "Character
> > > > > stream that handles all the necessary Voodoo to figure out the charset
> > > > > encoding of the XML document within the stream."
> > > > >
> > > > > What it's supposed to do in this case is use the "encoding='437'" from
> > > > > the input to determine that the Charset to use when decoding the byte
> > > > > stream is "437" (aka "code page 437").
> > > >
> > > > Sorry, I completely overlooked that.
> > > >
> > > > > -
> > > > > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> > > > > For additional commands, e-mail: user-h...@commons.apache.org
> > > > >
> > >
> > > -
> > > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> > > For additional commands, e-mail: user-h...@commons.apache.org
> > >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> > For additional commands, e-mail: user-h...@commons.apache.org
> >

-
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org



Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread sebb
On Tue, 3 Oct 2023 at 21:35, Laurence Gonsalves
 wrote:
>
> Thank you. My git bisect just found this change too. :-)
>
> We are processing documents that we have no control over, and some may use
> these numeric encodings, so we can't update the documents.
>
> Looking at the XML spec
> (https: //www.w3.org/TR/2008/REC-xml-20081126/#NT-EncName), it does say...
>
>EncName   ::=   [A-Za-z] ([A-Za-z0-9._] | '-')*
>
> ...and I assume that that's why the regex in XmlStreamReader was changed.

That would explain the somewhat convoluted RE.

> However, the same part of the XML spec also says...
>
> > It is recommended that character encodings registered (as charsets) with the
> > Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just
> > listed, be referred to using their registered names; other encodings should
> > use names starting with an "x-" prefix.
>
> ...and "IANA-CHARSETS" links to
> http://www.iana.org/assignments/character-sets/character-sets.xhtml
> which has a table of character sets. In that table there are three columns of
> possible codes for referring to charsets: "Preferred MIME", "Name", and
> "Aliases". The "Aliases" column includes "437" (that's also where "cp437" can
> be found).
>
> This seems like a bit of an inconsistency in the XML spec, as there are a
> number of "Name" and "Alias" values that don't quite match the definition of
> EncName.

Indeed.
There are also examples such as ISO_8859-1:1987 and ebcdic-de-273+euro

> Given this inconsistency, and the fact that there are XML documents "in the
> wild" that use these encoding names, would it be reasonable to relax the regex
> just enough so that it'll work with these other names and aliases?

I would say yes.

I'm currently working on a test to check that all charsets *and
aliases* can be matched, and on a fix to the RE.

>
> On Tue, Oct 3, 2023 at 12:24 PM sebb  wrote:
> >
> > Just had another look at the class: in 2.13, the regex for matching
> > the encoding string was
> > Pattern.compile("<\\?xml.*encoding[\\s]*=[\\s]*((?:\".[^\"]*\")|(?:'.[^']*'))",
> > Pattern.MULTILINE);
> >
> > In 2.14, the pattern includes the following matching for the encoding:
> > "encoding\\s*=\\s*((?:\"[A-Za-z]([A-Za-z0-9\\._]|-)*\")|(?:'[A-Za-z]([A-Za-z0-9._]|-)*'))",
> >
> > This does not allow for an encoding that starts with a digit; i.e. it
> > won't match encoding='437'
> >
> > AFAICT, no supported encodings start with a digit.
> >
> > The '437' encoding is actually known as 'Cp437':
> > https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
> > https://docs.oracle.com/en/java/javase/17/intl/supported-encodings.html
> >
> > Try using 'Cp437' as the encoding.
> >
> > On Tue, 3 Oct 2023 at 20:01, sebb  wrote:
> > >
> > > On Tue, 3 Oct 2023 at 18:05, Laurence Gonsalves
> > >  wrote:
> > > >
> > > > On Tue, Oct 3, 2023 at 1:39 AM sebb  wrote:
> > > > >
> > > > > The byte input stream does not carry any encoding information, so the
> > > > > XmlStreamReader has to guess what encoding was used.
> > > >
> > > > Determining what encoding to use when reading XML from a byte stream
> > > > is the purpose of XmlStreamReader. From its documentation: "Character
> > > > stream that handles all the necessary Voodoo to figure out the charset
> > > > encoding of the XML document within the stream."
> > > >
> > > > What it's supposed to do in this case is use the "encoding='437'" from
> > > > the input to determine that the Charset to use when decoding the byte
> > > > stream is "437" (aka "code page 437").
> > >
> > > Sorry, I completely overlooked that.
> > >
> > > > -
> > > > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> > > > For additional commands, e-mail: user-h...@commons.apache.org
> > > >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> > For additional commands, e-mail: user-h...@commons.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> For additional commands, e-mail: user-h...@commons.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org



Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread Laurence Gonsalves
Thank you. My git bisect just found this change too. :-)

We are processing documents that we have no control over, and some may use
these numeric encodings, so we can't update the documents.

Looking at the XML spec
(https: //www.w3.org/TR/2008/REC-xml-20081126/#NT-EncName), it does say...

   EncName   ::=   [A-Za-z] ([A-Za-z0-9._] | '-')*

...and I assume that that's why the regex in XmlStreamReader was changed.

However, the same part of the XML spec also says...

> It is recommended that character encodings registered (as charsets) with the
> Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just
> listed, be referred to using their registered names; other encodings should
> use names starting with an "x-" prefix.

...and "IANA-CHARSETS" links to
http://www.iana.org/assignments/character-sets/character-sets.xhtml
which has a table of character sets. In that table there are three columns of
possible codes for referring to charsets: "Preferred MIME", "Name", and
"Aliases". The "Aliases" column includes "437" (that's also where "cp437" can
be found).

This seems like a bit of an inconsistency in the XML spec, as there are a
number of "Name" and "Alias" values that don't quite match the definition of
EncName.

Given this inconsistency, and the fact that there are XML documents "in the
wild" that use these encoding names, would it be reasonable to relax the regex
just enough so that it'll work with these other names and aliases?


On Tue, Oct 3, 2023 at 12:24 PM sebb  wrote:
>
> Just had another look at the class: in 2.13, the regex for matching
> the encoding string was
> Pattern.compile("<\\?xml.*encoding[\\s]*=[\\s]*((?:\".[^\"]*\")|(?:'.[^']*'))",
> Pattern.MULTILINE);
>
> In 2.14, the pattern includes the following matching for the encoding:
> "encoding\\s*=\\s*((?:\"[A-Za-z]([A-Za-z0-9\\._]|-)*\")|(?:'[A-Za-z]([A-Za-z0-9._]|-)*'))",
>
> This does not allow for an encoding that starts with a digit; i.e. it
> won't match encoding='437'
>
> AFAICT, no supported encodings start with a digit.
>
> The '437' encoding is actually known as 'Cp437':
> https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
> https://docs.oracle.com/en/java/javase/17/intl/supported-encodings.html
>
> Try using 'Cp437' as the encoding.
>
> On Tue, 3 Oct 2023 at 20:01, sebb  wrote:
> >
> > On Tue, 3 Oct 2023 at 18:05, Laurence Gonsalves
> >  wrote:
> > >
> > > On Tue, Oct 3, 2023 at 1:39 AM sebb  wrote:
> > > >
> > > > The byte input stream does not carry any encoding information, so the
> > > > XmlStreamReader has to guess what encoding was used.
> > >
> > > Determining what encoding to use when reading XML from a byte stream
> > > is the purpose of XmlStreamReader. From its documentation: "Character
> > > stream that handles all the necessary Voodoo to figure out the charset
> > > encoding of the XML document within the stream."
> > >
> > > What it's supposed to do in this case is use the "encoding='437'" from
> > > the input to determine that the Charset to use when decoding the byte
> > > stream is "437" (aka "code page 437").
> >
> > Sorry, I completely overlooked that.
> >
> > > -
> > > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> > > For additional commands, e-mail: user-h...@commons.apache.org
> > >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> For additional commands, e-mail: user-h...@commons.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org



Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread sebb
Although the pages I linked don't mention them, it turns out that
there is actually an alias '437', also many other numeric ones.

Indeed there are other aliases that start with a letter but otherwise
don't match the RE.
e.g. ISO_8859-1:1987
So it seems the updated RE is indeed too restrictive.

Sorry for the confusion.

On Tue, 3 Oct 2023 at 20:22, sebb  wrote:
>
> Just had another look at the class: in 2.13, the regex for matching
> the encoding string was
> Pattern.compile("<\\?xml.*encoding[\\s]*=[\\s]*((?:\".[^\"]*\")|(?:'.[^']*'))",
> Pattern.MULTILINE);
>
> In 2.14, the pattern includes the following matching for the encoding:
> "encoding\\s*=\\s*((?:\"[A-Za-z]([A-Za-z0-9\\._]|-)*\")|(?:'[A-Za-z]([A-Za-z0-9._]|-)*'))",
>
> This does not allow for an encoding that starts with a digit; i.e. it
> won't match encoding='437'
>
> AFAICT, no supported encodings start with a digit.
>
> The '437' encoding is actually known as 'Cp437':
> https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
> https://docs.oracle.com/en/java/javase/17/intl/supported-encodings.html
>
> Try using 'Cp437' as the encoding.
>
> On Tue, 3 Oct 2023 at 20:01, sebb  wrote:
> >
> > On Tue, 3 Oct 2023 at 18:05, Laurence Gonsalves
> >  wrote:
> > >
> > > On Tue, Oct 3, 2023 at 1:39 AM sebb  wrote:
> > > >
> > > > The byte input stream does not carry any encoding information, so the
> > > > XmlStreamReader has to guess what encoding was used.
> > >
> > > Determining what encoding to use when reading XML from a byte stream
> > > is the purpose of XmlStreamReader. From its documentation: "Character
> > > stream that handles all the necessary Voodoo to figure out the charset
> > > encoding of the XML document within the stream."
> > >
> > > What it's supposed to do in this case is use the "encoding='437'" from
> > > the input to determine that the Charset to use when decoding the byte
> > > stream is "437" (aka "code page 437").
> >
> > Sorry, I completely overlooked that.
> >
> > > -
> > > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> > > For additional commands, e-mail: user-h...@commons.apache.org
> > >

-
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org



Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread sebb
Just had another look at the class: in 2.13, the regex for matching
the encoding string was
Pattern.compile("<\\?xml.*encoding[\\s]*=[\\s]*((?:\".[^\"]*\")|(?:'.[^']*'))",
Pattern.MULTILINE);

In 2.14, the pattern includes the following matching for the encoding:
"encoding\\s*=\\s*((?:\"[A-Za-z]([A-Za-z0-9\\._]|-)*\")|(?:'[A-Za-z]([A-Za-z0-9._]|-)*'))",

This does not allow for an encoding that starts with a digit; i.e. it
won't match encoding='437'

AFAICT, no supported encodings start with a digit.

The '437' encoding is actually known as 'Cp437':
https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
https://docs.oracle.com/en/java/javase/17/intl/supported-encodings.html

Try using 'Cp437' as the encoding.

On Tue, 3 Oct 2023 at 20:01, sebb  wrote:
>
> On Tue, 3 Oct 2023 at 18:05, Laurence Gonsalves
>  wrote:
> >
> > On Tue, Oct 3, 2023 at 1:39 AM sebb  wrote:
> > >
> > > The byte input stream does not carry any encoding information, so the
> > > XmlStreamReader has to guess what encoding was used.
> >
> > Determining what encoding to use when reading XML from a byte stream
> > is the purpose of XmlStreamReader. From its documentation: "Character
> > stream that handles all the necessary Voodoo to figure out the charset
> > encoding of the XML document within the stream."
> >
> > What it's supposed to do in this case is use the "encoding='437'" from
> > the input to determine that the Charset to use when decoding the byte
> > stream is "437" (aka "code page 437").
>
> Sorry, I completely overlooked that.
>
> > -
> > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> > For additional commands, e-mail: user-h...@commons.apache.org
> >

-
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org



Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread sebb
On Tue, 3 Oct 2023 at 18:05, Laurence Gonsalves
 wrote:
>
> On Tue, Oct 3, 2023 at 1:39 AM sebb  wrote:
> >
> > The byte input stream does not carry any encoding information, so the
> > XmlStreamReader has to guess what encoding was used.
>
> Determining what encoding to use when reading XML from a byte stream
> is the purpose of XmlStreamReader. From its documentation: "Character
> stream that handles all the necessary Voodoo to figure out the charset
> encoding of the XML document within the stream."
>
> What it's supposed to do in this case is use the "encoding='437'" from
> the input to determine that the Charset to use when decoding the byte
> stream is "437" (aka "code page 437").

Sorry, I completely overlooked that.

> -
> To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
> For additional commands, e-mail: user-h...@commons.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org



Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread Laurence Gonsalves
On Tue, Oct 3, 2023 at 1:39 AM sebb  wrote:
>
> The byte input stream does not carry any encoding information, so the
> XmlStreamReader has to guess what encoding was used.

Determining what encoding to use when reading XML from a byte stream
is the purpose of XmlStreamReader. From its documentation: "Character
stream that handles all the necessary Voodoo to figure out the charset
encoding of the XML document within the stream."

What it's supposed to do in this case is use the "encoding='437'" from
the input to determine that the Charset to use when decoding the byte
stream is "437" (aka "code page 437").

-
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org



Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread Gary Gregory
Feel free to provide a PR on GitHub where the unit test must fail if main
changes are not applied. You can also provide a PR that only contains a
unit test.

Gary


On Tue, Oct 3, 2023, 4:13 AM Laurence Gonsalves  wrote:

> Hello,
>
> It looks like XmlStreamReader is not correctly handling several encodings
> in Commons IO 2.14.0 that previously worked in version 2.13.0.
>
> Here's a self-contained snippet (Kotlin) that demonstrates the problem:
>
> val xml = "Ç"
>
> val stream = xml.byteInputStream(Charset.forName("437"))
>
> val reader = XmlStreamReader.builder()
> .setInputStream(stream)
> .setLenient(false)
> .get()
>
> reader.readText() shouldBe xml
>
> With 2.13.0 this code works fine, but in 2.14.0 the "Ç" (C-cedilla) becomes
> a "�" (Unicode replacement character).
>
> We're seeing similar issues with all of the other code page encodings we've
> tried (850, 852, 855, 857, 860, 861, 862, 863, 865, and 866).
>


Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?

2023-10-03 Thread sebb
The byte input stream does not carry any encoding information, so the
XmlStreamReader has to guess what encoding was used.

I'm surprised that it ever worked reliably.

On Tue, 3 Oct 2023 at 09:13, Laurence Gonsalves  wrote:
>
> Hello,
>
> It looks like XmlStreamReader is not correctly handling several encodings
> in Commons IO 2.14.0 that previously worked in version 2.13.0.
>
> Here's a self-contained snippet (Kotlin) that demonstrates the problem:
>
> val xml = "Ç"
>
> val stream = xml.byteInputStream(Charset.forName("437"))
>
> val reader = XmlStreamReader.builder()
> .setInputStream(stream)
> .setLenient(false)
> .get()
>
> reader.readText() shouldBe xml
>
> With 2.13.0 this code works fine, but in 2.14.0 the "Ç" (C-cedilla) becomes
> a "�" (Unicode replacement character).
>
> We're seeing similar issues with all of the other code page encodings we've
> tried (850, 852, 855, 857, 860, 861, 862, 863, 865, and 866).

-
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org