[
https://issues.apache.org/jira/browse/XERCESJ-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Glavassevich resolved XERCESJ-1614.
-------------------------------------------
Resolution: Duplicate
This is the same issue as XERCESJ-1257 (which has yet to be resolved).
> ArrayIndexOutOfBoundsException: 2048 and Invalid byte 2 of 4-byte UTF-8
> sequence.
> ----------------------------------------------------------------------------------
>
> Key: XERCESJ-1614
> URL: https://issues.apache.org/jira/browse/XERCESJ-1614
> Project: Xerces2-J
> Issue Type: Bug
> Affects Versions: 2.7.0, 2.7.1, 2.8.0, 2.8.1, 2.9.0, 2.9.1, 2.10.0, 2.11.0
> Environment: Ubuntu 10.04, openjdk6
> Reporter: Michael Tsikerdekis
> Priority: Critical
> Labels: patch
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Upon importing files from wikipedia using mwdumper the script fails in
> several files. This happens in multiple dumps (I tried the dumps of May and
> January). A file that you can try and it is about ~200 mb is:
> enwiki-20130503-pages-meta-history1.xml-p000006887p000009316.7z found in
> http://dumps.wikimedia.org/enwiki/20130503/
> In mwdumper version using xerces 2.7.1 the error is the following:
> 7za e -so enwiki-20130503-pages-meta-history1.xml-p000006887p000009316.7z
> |java -server -jar mwdumper-1.16.jar --format=sql:1.5 | gzip -vc >
> temp.sql.gz
> 7-Zip (A) 9.04 beta Copyright (c) 1999-2009 Igor Pavlov 2009-05-30
> p7zip Version 9.04 (locale=en_US.ISO-8859-15,Utf16=on,HugeFiles=on,8 CPUs)
> Processing archive:
> enwiki-20130503-pages-meta-history1.xml-p000006887p000009316.7z
> Extracting enwiki-20130503-pages-meta-history1.xml-p000006887p0000093163
> pages (1.165/sec), 1,000 revs (388.35/sec)
> 3 pages (0.356/sec), 2,000 revs (237.164/sec)
> 8 pages (0.677/sec), 3,000 revs (253.807/sec)
> 13 pages (1.058/sec), 4,000 revs (325.627/sec)
> 13 pages (0.992/sec), 5,000 revs (381.505/sec)
> 16 pages (1.169/sec), 6,000 revs (438.436/sec)
> 16 pages (1.016/sec), 7,000 revs (444.501/sec)
> 17 pages (0.854/sec), 8,000 revs (401.849/sec)
> 17 pages (0.695/sec), 9,000 revs (367.752/sec)
> 18 pages (0.675/sec), 10,000 revs (374.967/sec)
> 18 pages (0.653/sec), 11,000 revs (399.332/sec)
> 18 pages (0.626/sec), 12,000 revs (417.043/sec)
> 18 pages (0.6/sec), 13,000 revs (433.117/sec)
> 18 pages (0.555/sec), 14,000 revs (431.766/sec)
> 18 pages (0.499/sec), 15,000 revs (416.17/sec)
> 19 pages (0.509/sec), 16,000 revs (428.483/sec)
> 22 pages (0.58/sec), 17,000 revs (448.43/sec)
> 22 pages (0.571/sec), 18,000 revs (467.302/sec)
> 23 pages (0.546/sec), 19,000 revs (450.835/sec)
> 24 pages (0.564/sec), 20,000 revs (469.649/sec)
> 26 pages (0.587/sec), 21,000 revs (473.912/sec)
> 28 pages (0.623/sec), 22,000 revs (489.182/sec)
> 31 pages (0.684/sec), 23,000 revs (507.469/sec)
> 31 pages (0.647/sec), 24,000 revs (500.584/sec)
> 33 pages (0.655/sec), 25,000 revs (495.835/sec)
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
> at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
> at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
> at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
> Source)
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
> Source)
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
> Source)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
> at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
> at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
> 77.4%
> In mwdumper build for another dump with 2.11.0 xerces the error is the
> following(pasting the final lines):
> $ cat enwiki-20130102-pages-meta-history1.xml-p000004284p000005735 | java
> -server -jar mwdumper-1.16-2.11.0.jar --format=sql:1.5 > temp.sql
> 289 pages (0.233/sec), 360,000 revs (290.012/sec)
> 289 pages (0.229/sec), 361,000 revs (286.432/sec)
> 289 pages (0.226/sec), 362,000 revs (283.608/sec)
> 289 pages (0.225/sec), 363,000 revs (282.209/sec)
> 289 pages (0.222/sec), 364,000 revs (280.006/sec)
> 289 pages (0.22/sec), 365,000 revs (277.282/sec)
> Exception in thread "main" java.io.IOException: Invalid byte 2 of 4-byte
> UTF-8 sequence.
> at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:92)
> at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 128484149;
> columnNumber: 94; Invalid byte 2 of 4-byte UTF-8 sequence.
> at
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown
> Source)
> at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
> Source)
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
> Source)
> at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
> at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
> ... 1 more
> Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid
> byte 2 of 4-byte UTF-8 sequence.
> at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
> at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
> at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
> at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
> Source)
> ... 11 more
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]