[
https://issues.apache.org/jira/browse/XERCESJ-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110954#comment-15110954
]
Ian Upright commented on XERCESJ-1257:
--------------------------------------
For those using mwdumper to load wikipedia or other sources and encountering
this issue, this change seemed to fix it. (also serves as an example of how to
workaround it) However, it would be good to have the real issue addressed. I
would vote to modify Xerces to simply use the JVM to decode UTF-8 as Michael
suggested.
public void readDump() throws IOException {
try {
SAXParserFactory factory =
SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
Reader reader = new InputStreamReader(input,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
parser.parse(is, this);
} catch (ParserConfigurationException e) {
throw (IOException)new
IOException(e.getMessage()).initCause(e);
} catch (SAXException e) {
throw (IOException)new
IOException(e.getMessage()).initCause(e);
}
writer.close();
}
> buffer overflow in UTF8Reader for characters out of BMP
> -------------------------------------------------------
>
> Key: XERCESJ-1257
> URL: https://issues.apache.org/jira/browse/XERCESJ-1257
> Project: Xerces2-J
> Issue Type: Bug
> Components: JAXP (javax.xml.parsers)
> Affects Versions: 2.9.0
> Environment: Any
> Reporter: Robert Stojnic
> Assignee: Michael Glavassevich
> Priority: Minor
> Attachments: TestXerces.java, UTF8Reader.patch,
> XERCESJ-1257_tests.patch
>
>
> There is a ArrayOutOfBoundsException in org.apache.xerces.impl.io.UTF8Reader,
> in read(char[],int,int) for 4-byte utf-8 chars.
> Imagine a following scenario. read() has a buffer of size N, and it reads N-1
> ascii chars, and stores it in the output buffer. Let the Nth char be the
> first byte of a 4 byte utf-8 char. The other 3 bytes are fetched by invoking
> read() on the input stream. From these a surrogate pair of java chars is
> made, however, method does not check if both chars can fit into the output
> buffer ... In most cases, they would fit into the ouput buffer (e.g. if there
> are some other multi-byte chars in the fetched text), so the bug is very
> rare, but it still happens.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]