[
https://issues.apache.org/jira/browse/XERCESJ-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jan Berkel updated XERCESJ-1668:
--------------------------------
Description:
There's a bug in the surrogate handling when the reader buffer is exhausted and
only the high-part can be written. On the next run the low-part gets added but
the buffer space calculation is off by one.
This gets triggered when parsing the current [enwiktionary dump
file|http://dumps.wikimedia.org/enwiktionary/20151102/enwiktionary-20151102-pages-articles.xml.bz2].
{noformat}
org.xml.sax.SAXParseException; lineNumber: 99849520; columnNumber: 47; Invalid
byte 2 of 4-byte UTF-8 sequence.
{noformat}
In the attached patch I added a fix + testcase for this bug. Another related
issue is that when the low-part is written as last part of the stream -1 is
returned instead of 1.
Is UTF8Reader still necessary? It might be safer to just use a plain
InputStreamReader.
was:
There's a bug in the surrogate handling when the reader buffer is exhausted and
only the high-part can be written. On the next run the low-part gets added but
the buffer space calculation is off by one.
This gets triggered when parsing the current [enwiktionary dump
file|http://dumps.wikimedia.org/enwiktionary/20151102/enwiktionary-20151102-pages-articles.xml.bz2].
{noformat}
org.xml.sax.SAXParseException; lineNumber: 99849520; columnNumber: 47; Invalid
byte 2 of 4-byte UTF-8 sequence.
{noformat}
In the attached patch I added a fix + testcase for this bug. Another related
issue is that when the low-part is written as last part of the stream -1 is
returned instead of 1.
Is UTF8 reader still necessary? It might be safer to just use a plain
InputStreamReader.
> Off-by-one bug w/ surrogates in UTF8Reader
> ------------------------------------------
>
> Key: XERCESJ-1668
> URL: https://issues.apache.org/jira/browse/XERCESJ-1668
> Project: Xerces2-J
> Issue Type: Bug
> Components: Other
> Reporter: Jan Berkel
> Attachments: surrogate.patch
>
>
> There's a bug in the surrogate handling when the reader buffer is exhausted
> and only the high-part can be written. On the next run the low-part gets
> added but the buffer space calculation is off by one.
> This gets triggered when parsing the current [enwiktionary dump
> file|http://dumps.wikimedia.org/enwiktionary/20151102/enwiktionary-20151102-pages-articles.xml.bz2].
> {noformat}
> org.xml.sax.SAXParseException; lineNumber: 99849520; columnNumber: 47;
> Invalid byte 2 of 4-byte UTF-8 sequence.
> {noformat}
> In the attached patch I added a fix + testcase for this bug. Another related
> issue is that when the low-part is written as last part of the stream -1 is
> returned instead of 1.
> Is UTF8Reader still necessary? It might be safer to just use a plain
> InputStreamReader.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]