Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?
On Tue, 3 Oct 2023 at 22:17, Laurence Gonsalves wrote: > > On Tue, Oct 3, 2023 at 1:50 PM sebb wrote: > > > Given this inconsistency, and the fact that there are XML documents "in > > > the > > > wild" that use these encoding names, would it be reasonable to relax the > > > regex > > > just enough so that it'll work with these other names and aliases? > > > > I would say yes. > > > > I'm currently working on a test to check that all charsets *and > > aliases* can be matched, and on a fix to the RE. Test and fix have been committed. > Excellent! Thank you! Thank you for raising the issue. > - > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org > For additional commands, e-mail: user-h...@commons.apache.org > - To unsubscribe, e-mail: user-unsubscr...@commons.apache.org For additional commands, e-mail: user-h...@commons.apache.org
Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?
On Tue, Oct 3, 2023 at 1:50 PM sebb wrote: > > Given this inconsistency, and the fact that there are XML documents "in the > > wild" that use these encoding names, would it be reasonable to relax the > > regex > > just enough so that it'll work with these other names and aliases? > > I would say yes. > > I'm currently working on a test to check that all charsets *and > aliases* can be matched, and on a fix to the RE. Excellent! Thank you! - To unsubscribe, e-mail: user-unsubscr...@commons.apache.org For additional commands, e-mail: user-h...@commons.apache.org
Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?
Created https://issues.apache.org/jira/browse/IO-815 On Tue, 3 Oct 2023 at 21:49, sebb wrote: > > On Tue, 3 Oct 2023 at 21:35, Laurence Gonsalves > wrote: > > > > Thank you. My git bisect just found this change too. :-) > > > > We are processing documents that we have no control over, and some may use > > these numeric encodings, so we can't update the documents. > > > > Looking at the XML spec > > (https: //www.w3.org/TR/2008/REC-xml-20081126/#NT-EncName), it does say... > > > >EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')* > > > > ...and I assume that that's why the regex in XmlStreamReader was changed. > > That would explain the somewhat convoluted RE. > > > However, the same part of the XML spec also says... > > > > > It is recommended that character encodings registered (as charsets) with > > > the > > > Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just > > > listed, be referred to using their registered names; other encodings > > > should > > > use names starting with an "x-" prefix. > > > > ...and "IANA-CHARSETS" links to > > http://www.iana.org/assignments/character-sets/character-sets.xhtml > > which has a table of character sets. In that table there are three columns > > of > > possible codes for referring to charsets: "Preferred MIME", "Name", and > > "Aliases". The "Aliases" column includes "437" (that's also where "cp437" > > can > > be found). > > > > This seems like a bit of an inconsistency in the XML spec, as there are a > > number of "Name" and "Alias" values that don't quite match the definition of > > EncName. > > Indeed. > There are also examples such as ISO_8859-1:1987 and ebcdic-de-273+euro > > > Given this inconsistency, and the fact that there are XML documents "in the > > wild" that use these encoding names, would it be reasonable to relax the > > regex > > just enough so that it'll work with these other names and aliases? > > I would say yes. > > I'm currently working on a test to check that all charsets *and > aliases* can be matched, and on a fix to the RE. > > > > > On Tue, Oct 3, 2023 at 12:24 PM sebb wrote: > > > > > > Just had another look at the class: in 2.13, the regex for matching > > > the encoding string was > > > Pattern.compile("<\\?xml.*encoding[\\s]*=[\\s]*((?:\".[^\"]*\")|(?:'.[^']*'))", > > > Pattern.MULTILINE); > > > > > > In 2.14, the pattern includes the following matching for the encoding: > > > "encoding\\s*=\\s*((?:\"[A-Za-z]([A-Za-z0-9\\._]|-)*\")|(?:'[A-Za-z]([A-Za-z0-9._]|-)*'))", > > > > > > This does not allow for an encoding that starts with a digit; i.e. it > > > won't match encoding='437' > > > > > > AFAICT, no supported encodings start with a digit. > > > > > > The '437' encoding is actually known as 'Cp437': > > > https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html > > > https://docs.oracle.com/en/java/javase/17/intl/supported-encodings.html > > > > > > Try using 'Cp437' as the encoding. > > > > > > On Tue, 3 Oct 2023 at 20:01, sebb wrote: > > > > > > > > On Tue, 3 Oct 2023 at 18:05, Laurence Gonsalves > > > > wrote: > > > > > > > > > > On Tue, Oct 3, 2023 at 1:39 AM sebb wrote: > > > > > > > > > > > > The byte input stream does not carry any encoding information, so > > > > > > the > > > > > > XmlStreamReader has to guess what encoding was used. > > > > > > > > > > Determining what encoding to use when reading XML from a byte stream > > > > > is the purpose of XmlStreamReader. From its documentation: "Character > > > > > stream that handles all the necessary Voodoo to figure out the charset > > > > > encoding of the XML document within the stream." > > > > > > > > > > What it's supposed to do in this case is use the "encoding='437'" from > > > > > the input to determine that the Charset to use when decoding the byte > > > > > stream is "437" (aka "code page 437"). > > > > > > > > Sorry, I completely overlooked that. > > > > > > > > > - > > > > > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org > > > > > For additional commands, e-mail: user-h...@commons.apache.org > > > > > > > > > > > - > > > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org > > > For additional commands, e-mail: user-h...@commons.apache.org > > > > > > > - > > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org > > For additional commands, e-mail: user-h...@commons.apache.org > > - To unsubscribe, e-mail: user-unsubscr...@commons.apache.org For additional commands, e-mail: user-h...@commons.apache.org
Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?
On Tue, 3 Oct 2023 at 21:35, Laurence Gonsalves wrote: > > Thank you. My git bisect just found this change too. :-) > > We are processing documents that we have no control over, and some may use > these numeric encodings, so we can't update the documents. > > Looking at the XML spec > (https: //www.w3.org/TR/2008/REC-xml-20081126/#NT-EncName), it does say... > >EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')* > > ...and I assume that that's why the regex in XmlStreamReader was changed. That would explain the somewhat convoluted RE. > However, the same part of the XML spec also says... > > > It is recommended that character encodings registered (as charsets) with the > > Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just > > listed, be referred to using their registered names; other encodings should > > use names starting with an "x-" prefix. > > ...and "IANA-CHARSETS" links to > http://www.iana.org/assignments/character-sets/character-sets.xhtml > which has a table of character sets. In that table there are three columns of > possible codes for referring to charsets: "Preferred MIME", "Name", and > "Aliases". The "Aliases" column includes "437" (that's also where "cp437" can > be found). > > This seems like a bit of an inconsistency in the XML spec, as there are a > number of "Name" and "Alias" values that don't quite match the definition of > EncName. Indeed. There are also examples such as ISO_8859-1:1987 and ebcdic-de-273+euro > Given this inconsistency, and the fact that there are XML documents "in the > wild" that use these encoding names, would it be reasonable to relax the regex > just enough so that it'll work with these other names and aliases? I would say yes. I'm currently working on a test to check that all charsets *and aliases* can be matched, and on a fix to the RE. > > On Tue, Oct 3, 2023 at 12:24 PM sebb wrote: > > > > Just had another look at the class: in 2.13, the regex for matching > > the encoding string was > > Pattern.compile("<\\?xml.*encoding[\\s]*=[\\s]*((?:\".[^\"]*\")|(?:'.[^']*'))", > > Pattern.MULTILINE); > > > > In 2.14, the pattern includes the following matching for the encoding: > > "encoding\\s*=\\s*((?:\"[A-Za-z]([A-Za-z0-9\\._]|-)*\")|(?:'[A-Za-z]([A-Za-z0-9._]|-)*'))", > > > > This does not allow for an encoding that starts with a digit; i.e. it > > won't match encoding='437' > > > > AFAICT, no supported encodings start with a digit. > > > > The '437' encoding is actually known as 'Cp437': > > https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html > > https://docs.oracle.com/en/java/javase/17/intl/supported-encodings.html > > > > Try using 'Cp437' as the encoding. > > > > On Tue, 3 Oct 2023 at 20:01, sebb wrote: > > > > > > On Tue, 3 Oct 2023 at 18:05, Laurence Gonsalves > > > wrote: > > > > > > > > On Tue, Oct 3, 2023 at 1:39 AM sebb wrote: > > > > > > > > > > The byte input stream does not carry any encoding information, so the > > > > > XmlStreamReader has to guess what encoding was used. > > > > > > > > Determining what encoding to use when reading XML from a byte stream > > > > is the purpose of XmlStreamReader. From its documentation: "Character > > > > stream that handles all the necessary Voodoo to figure out the charset > > > > encoding of the XML document within the stream." > > > > > > > > What it's supposed to do in this case is use the "encoding='437'" from > > > > the input to determine that the Charset to use when decoding the byte > > > > stream is "437" (aka "code page 437"). > > > > > > Sorry, I completely overlooked that. > > > > > > > - > > > > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org > > > > For additional commands, e-mail: user-h...@commons.apache.org > > > > > > > > - > > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org > > For additional commands, e-mail: user-h...@commons.apache.org > > > > - > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org > For additional commands, e-mail: user-h...@commons.apache.org > - To unsubscribe, e-mail: user-unsubscr...@commons.apache.org For additional commands, e-mail: user-h...@commons.apache.org
Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?
Thank you. My git bisect just found this change too. :-) We are processing documents that we have no control over, and some may use these numeric encodings, so we can't update the documents. Looking at the XML spec (https: //www.w3.org/TR/2008/REC-xml-20081126/#NT-EncName), it does say... EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')* ...and I assume that that's why the regex in XmlStreamReader was changed. However, the same part of the XML spec also says... > It is recommended that character encodings registered (as charsets) with the > Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just > listed, be referred to using their registered names; other encodings should > use names starting with an "x-" prefix. ...and "IANA-CHARSETS" links to http://www.iana.org/assignments/character-sets/character-sets.xhtml which has a table of character sets. In that table there are three columns of possible codes for referring to charsets: "Preferred MIME", "Name", and "Aliases". The "Aliases" column includes "437" (that's also where "cp437" can be found). This seems like a bit of an inconsistency in the XML spec, as there are a number of "Name" and "Alias" values that don't quite match the definition of EncName. Given this inconsistency, and the fact that there are XML documents "in the wild" that use these encoding names, would it be reasonable to relax the regex just enough so that it'll work with these other names and aliases? On Tue, Oct 3, 2023 at 12:24 PM sebb wrote: > > Just had another look at the class: in 2.13, the regex for matching > the encoding string was > Pattern.compile("<\\?xml.*encoding[\\s]*=[\\s]*((?:\".[^\"]*\")|(?:'.[^']*'))", > Pattern.MULTILINE); > > In 2.14, the pattern includes the following matching for the encoding: > "encoding\\s*=\\s*((?:\"[A-Za-z]([A-Za-z0-9\\._]|-)*\")|(?:'[A-Za-z]([A-Za-z0-9._]|-)*'))", > > This does not allow for an encoding that starts with a digit; i.e. it > won't match encoding='437' > > AFAICT, no supported encodings start with a digit. > > The '437' encoding is actually known as 'Cp437': > https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html > https://docs.oracle.com/en/java/javase/17/intl/supported-encodings.html > > Try using 'Cp437' as the encoding. > > On Tue, 3 Oct 2023 at 20:01, sebb wrote: > > > > On Tue, 3 Oct 2023 at 18:05, Laurence Gonsalves > > wrote: > > > > > > On Tue, Oct 3, 2023 at 1:39 AM sebb wrote: > > > > > > > > The byte input stream does not carry any encoding information, so the > > > > XmlStreamReader has to guess what encoding was used. > > > > > > Determining what encoding to use when reading XML from a byte stream > > > is the purpose of XmlStreamReader. From its documentation: "Character > > > stream that handles all the necessary Voodoo to figure out the charset > > > encoding of the XML document within the stream." > > > > > > What it's supposed to do in this case is use the "encoding='437'" from > > > the input to determine that the Charset to use when decoding the byte > > > stream is "437" (aka "code page 437"). > > > > Sorry, I completely overlooked that. > > > > > - > > > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org > > > For additional commands, e-mail: user-h...@commons.apache.org > > > > > - > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org > For additional commands, e-mail: user-h...@commons.apache.org > - To unsubscribe, e-mail: user-unsubscr...@commons.apache.org For additional commands, e-mail: user-h...@commons.apache.org
Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?
Although the pages I linked don't mention them, it turns out that there is actually an alias '437', also many other numeric ones. Indeed there are other aliases that start with a letter but otherwise don't match the RE. e.g. ISO_8859-1:1987 So it seems the updated RE is indeed too restrictive. Sorry for the confusion. On Tue, 3 Oct 2023 at 20:22, sebb wrote: > > Just had another look at the class: in 2.13, the regex for matching > the encoding string was > Pattern.compile("<\\?xml.*encoding[\\s]*=[\\s]*((?:\".[^\"]*\")|(?:'.[^']*'))", > Pattern.MULTILINE); > > In 2.14, the pattern includes the following matching for the encoding: > "encoding\\s*=\\s*((?:\"[A-Za-z]([A-Za-z0-9\\._]|-)*\")|(?:'[A-Za-z]([A-Za-z0-9._]|-)*'))", > > This does not allow for an encoding that starts with a digit; i.e. it > won't match encoding='437' > > AFAICT, no supported encodings start with a digit. > > The '437' encoding is actually known as 'Cp437': > https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html > https://docs.oracle.com/en/java/javase/17/intl/supported-encodings.html > > Try using 'Cp437' as the encoding. > > On Tue, 3 Oct 2023 at 20:01, sebb wrote: > > > > On Tue, 3 Oct 2023 at 18:05, Laurence Gonsalves > > wrote: > > > > > > On Tue, Oct 3, 2023 at 1:39 AM sebb wrote: > > > > > > > > The byte input stream does not carry any encoding information, so the > > > > XmlStreamReader has to guess what encoding was used. > > > > > > Determining what encoding to use when reading XML from a byte stream > > > is the purpose of XmlStreamReader. From its documentation: "Character > > > stream that handles all the necessary Voodoo to figure out the charset > > > encoding of the XML document within the stream." > > > > > > What it's supposed to do in this case is use the "encoding='437'" from > > > the input to determine that the Charset to use when decoding the byte > > > stream is "437" (aka "code page 437"). > > > > Sorry, I completely overlooked that. > > > > > - > > > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org > > > For additional commands, e-mail: user-h...@commons.apache.org > > > - To unsubscribe, e-mail: user-unsubscr...@commons.apache.org For additional commands, e-mail: user-h...@commons.apache.org
Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?
Just had another look at the class: in 2.13, the regex for matching the encoding string was Pattern.compile("<\\?xml.*encoding[\\s]*=[\\s]*((?:\".[^\"]*\")|(?:'.[^']*'))", Pattern.MULTILINE); In 2.14, the pattern includes the following matching for the encoding: "encoding\\s*=\\s*((?:\"[A-Za-z]([A-Za-z0-9\\._]|-)*\")|(?:'[A-Za-z]([A-Za-z0-9._]|-)*'))", This does not allow for an encoding that starts with a digit; i.e. it won't match encoding='437' AFAICT, no supported encodings start with a digit. The '437' encoding is actually known as 'Cp437': https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html https://docs.oracle.com/en/java/javase/17/intl/supported-encodings.html Try using 'Cp437' as the encoding. On Tue, 3 Oct 2023 at 20:01, sebb wrote: > > On Tue, 3 Oct 2023 at 18:05, Laurence Gonsalves > wrote: > > > > On Tue, Oct 3, 2023 at 1:39 AM sebb wrote: > > > > > > The byte input stream does not carry any encoding information, so the > > > XmlStreamReader has to guess what encoding was used. > > > > Determining what encoding to use when reading XML from a byte stream > > is the purpose of XmlStreamReader. From its documentation: "Character > > stream that handles all the necessary Voodoo to figure out the charset > > encoding of the XML document within the stream." > > > > What it's supposed to do in this case is use the "encoding='437'" from > > the input to determine that the Charset to use when decoding the byte > > stream is "437" (aka "code page 437"). > > Sorry, I completely overlooked that. > > > - > > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org > > For additional commands, e-mail: user-h...@commons.apache.org > > - To unsubscribe, e-mail: user-unsubscr...@commons.apache.org For additional commands, e-mail: user-h...@commons.apache.org
Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?
On Tue, 3 Oct 2023 at 18:05, Laurence Gonsalves wrote: > > On Tue, Oct 3, 2023 at 1:39 AM sebb wrote: > > > > The byte input stream does not carry any encoding information, so the > > XmlStreamReader has to guess what encoding was used. > > Determining what encoding to use when reading XML from a byte stream > is the purpose of XmlStreamReader. From its documentation: "Character > stream that handles all the necessary Voodoo to figure out the charset > encoding of the XML document within the stream." > > What it's supposed to do in this case is use the "encoding='437'" from > the input to determine that the Charset to use when decoding the byte > stream is "437" (aka "code page 437"). Sorry, I completely overlooked that. > - > To unsubscribe, e-mail: user-unsubscr...@commons.apache.org > For additional commands, e-mail: user-h...@commons.apache.org > - To unsubscribe, e-mail: user-unsubscr...@commons.apache.org For additional commands, e-mail: user-h...@commons.apache.org
Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?
On Tue, Oct 3, 2023 at 1:39 AM sebb wrote: > > The byte input stream does not carry any encoding information, so the > XmlStreamReader has to guess what encoding was used. Determining what encoding to use when reading XML from a byte stream is the purpose of XmlStreamReader. From its documentation: "Character stream that handles all the necessary Voodoo to figure out the charset encoding of the XML document within the stream." What it's supposed to do in this case is use the "encoding='437'" from the input to determine that the Charset to use when decoding the byte stream is "437" (aka "code page 437"). - To unsubscribe, e-mail: user-unsubscr...@commons.apache.org For additional commands, e-mail: user-h...@commons.apache.org
Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?
Feel free to provide a PR on GitHub where the unit test must fail if main changes are not applied. You can also provide a PR that only contains a unit test. Gary On Tue, Oct 3, 2023, 4:13 AM Laurence Gonsalves wrote: > Hello, > > It looks like XmlStreamReader is not correctly handling several encodings > in Commons IO 2.14.0 that previously worked in version 2.13.0. > > Here's a self-contained snippet (Kotlin) that demonstrates the problem: > > val xml = "Ç" > > val stream = xml.byteInputStream(Charset.forName("437")) > > val reader = XmlStreamReader.builder() > .setInputStream(stream) > .setLenient(false) > .get() > > reader.readText() shouldBe xml > > With 2.13.0 this code works fine, but in 2.14.0 the "Ç" (C-cedilla) becomes > a "�" (Unicode replacement character). > > We're seeing similar issues with all of the other code page encodings we've > tried (850, 852, 855, 857, 860, 861, 862, 863, 865, and 866). >
Re: [io] Encoding bug in XmlStreamReader in Commons IO 2.14.0?
The byte input stream does not carry any encoding information, so the XmlStreamReader has to guess what encoding was used. I'm surprised that it ever worked reliably. On Tue, 3 Oct 2023 at 09:13, Laurence Gonsalves wrote: > > Hello, > > It looks like XmlStreamReader is not correctly handling several encodings > in Commons IO 2.14.0 that previously worked in version 2.13.0. > > Here's a self-contained snippet (Kotlin) that demonstrates the problem: > > val xml = "Ç" > > val stream = xml.byteInputStream(Charset.forName("437")) > > val reader = XmlStreamReader.builder() > .setInputStream(stream) > .setLenient(false) > .get() > > reader.readText() shouldBe xml > > With 2.13.0 this code works fine, but in 2.14.0 the "Ç" (C-cedilla) becomes > a "�" (Unicode replacement character). > > We're seeing similar issues with all of the other code page encodings we've > tried (850, 852, 855, 857, 860, 861, 862, 863, 865, and 866). - To unsubscribe, e-mail: user-unsubscr...@commons.apache.org For additional commands, e-mail: user-h...@commons.apache.org