[
https://issues.apache.org/jira/browse/XERCESJ-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241487#comment-13241487
]
Michael McCandless commented on XERCESJ-1257:
---------------------------------------------
Note that this (more recent) Wikipedia export also hits this bug:
enwiki-20110115-pages-articles.xml.bz2
We are still struggling with this nasty Xerces UTF8 bug in Lucene, this time
because we (Lucene committers) want/need to stop shipping the custom Xerces
Java JAR (compiled with the patch on this issue) in Lucene, in our source
releases.
At first, we explored ant automation, to pull the Xerces Java 2.9.1 source
release, apply the patch here, and build the custom JAR... that seems to work
but:
In LUCENE-3937 we found a new approach: we can instead work around this bug by
using the JVM, not Xerces, to (correctly) decode UTF8, by passing a Reader
instead of an InputStream to Xerces (I now see that this was already suggested
by Michael as a workaround: doh!).
Then we can use the stock (but buggy) Xerces releases... no patches / custom
Xerces JARs needed in Lucene.
Still, it would be best if the Xerces committers could commit the current patch
(if there are no problems with it) and finally resolve this longstanding issue.
Or maybe disable Xerces's custom UTF8 decoding (just use the JVM's)?
> buffer overflow in UTF8Reader for characters out of BMP
> -------------------------------------------------------
>
> Key: XERCESJ-1257
> URL: https://issues.apache.org/jira/browse/XERCESJ-1257
> Project: Xerces2-J
> Issue Type: Bug
> Components: JAXP (javax.xml.parsers)
> Affects Versions: 2.9.0
> Environment: Any
> Reporter: Robert Stojnic
> Assignee: Michael Glavassevich
> Priority: Minor
> Attachments: TestXerces.java, UTF8Reader.patch
>
>
> There is a ArrayOutOfBoundsException in org.apache.xerces.impl.io.UTF8Reader,
> in read(char[],int,int) for 4-byte utf-8 chars.
> Imagine a following scenario. read() has a buffer of size N, and it reads N-1
> ascii chars, and stores it in the output buffer. Let the Nth char be the
> first byte of a 4 byte utf-8 char. The other 3 bytes are fetched by invoking
> read() on the input stream. From these a surrogate pair of java chars is
> made, however, method does not check if both chars can fit into the output
> buffer ... In most cases, they would fit into the ouput buffer (e.g. if there
> are some other multi-byte chars in the fetched text), so the bug is very
> rare, but it still happens.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]