Michael Tsikerdekis created XERCESJ-1614:
--------------------------------------------

             Summary: ArrayIndexOutOfBoundsException: 2048 and  Invalid byte 2 
of 4-byte UTF-8 sequence.
                 Key: XERCESJ-1614
                 URL: https://issues.apache.org/jira/browse/XERCESJ-1614
             Project: Xerces2-J
          Issue Type: Bug
    Affects Versions: 2.11.0, 2.10.0, 2.9.1, 2.9.0, 2.8.1, 2.8.0, 2.7.1, 2.7.0
         Environment: Ubuntu 10.04, openjdk6
            Reporter: Michael Tsikerdekis
            Priority: Critical


Upon importing files from wikipedia using mwdumper the script fails in several 
files. This happens in multiple dumps (I tried the dumps of May and January). A 
file that you can try and it is about ~200 mb is: 
enwiki-20130503-pages-meta-history1.xml-p000006887p000009316.7z found in 
http://dumps.wikimedia.org/enwiki/20130503/


In mwdumper version using xerces 2.7.1 the error is the following:
 7za e -so enwiki-20130503-pages-meta-history1.xml-p000006887p000009316.7z 
|java -server -jar mwdumper-1.16.jar --format=sql:1.5  | gzip -vc > temp.sql.gz

7-Zip (A) 9.04 beta  Copyright (c) 1999-2009 Igor Pavlov  2009-05-30
p7zip Version 9.04 (locale=en_US.ISO-8859-15,Utf16=on,HugeFiles=on,8 CPUs)

Processing archive: 
enwiki-20130503-pages-meta-history1.xml-p000006887p000009316.7z

Extracting  enwiki-20130503-pages-meta-history1.xml-p000006887p0000093163 pages 
(1.165/sec), 1,000 revs (388.35/sec)
3 pages (0.356/sec), 2,000 revs (237.164/sec)
8 pages (0.677/sec), 3,000 revs (253.807/sec)
13 pages (1.058/sec), 4,000 revs (325.627/sec)
13 pages (0.992/sec), 5,000 revs (381.505/sec)
16 pages (1.169/sec), 6,000 revs (438.436/sec)
16 pages (1.016/sec), 7,000 revs (444.501/sec)
17 pages (0.854/sec), 8,000 revs (401.849/sec)
17 pages (0.695/sec), 9,000 revs (367.752/sec)
18 pages (0.675/sec), 10,000 revs (374.967/sec)
18 pages (0.653/sec), 11,000 revs (399.332/sec)
18 pages (0.626/sec), 12,000 revs (417.043/sec)
18 pages (0.6/sec), 13,000 revs (433.117/sec)
18 pages (0.555/sec), 14,000 revs (431.766/sec)
18 pages (0.499/sec), 15,000 revs (416.17/sec)
19 pages (0.509/sec), 16,000 revs (428.483/sec)
22 pages (0.58/sec), 17,000 revs (448.43/sec)
22 pages (0.571/sec), 18,000 revs (467.302/sec)
23 pages (0.546/sec), 19,000 revs (450.835/sec)
24 pages (0.564/sec), 20,000 revs (469.649/sec)
26 pages (0.587/sec), 21,000 revs (473.912/sec)
28 pages (0.623/sec), 22,000 revs (489.182/sec)
31 pages (0.684/sec), 23,000 revs (507.469/sec)
31 pages (0.647/sec), 24,000 revs (500.584/sec)
33 pages (0.655/sec), 25,000 revs (495.835/sec)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown 
Source)
        at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
 Source)
        at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown 
Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
        at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
 77.4%

In mwdumper build for another dump with 2.11.0 xerces the error is the 
following(pasting the final lines):

$ cat enwiki-20130102-pages-meta-history1.xml-p000004284p000005735 | java 
-server -jar mwdumper-1.16-2.11.0.jar --format=sql:1.5 > temp.sql
289 pages (0.233/sec), 360,000 revs (290.012/sec)
289 pages (0.229/sec), 361,000 revs (286.432/sec)
289 pages (0.226/sec), 362,000 revs (283.608/sec)
289 pages (0.225/sec), 363,000 revs (282.209/sec)
289 pages (0.222/sec), 364,000 revs (280.006/sec)
289 pages (0.22/sec), 365,000 revs (277.282/sec)
Exception in thread "main" java.io.IOException: Invalid byte 2 of 4-byte UTF-8 
sequence.
      at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:92)
      at org.mediawiki.dumper.Dumper.main(Dumper.java:142)
Caused by: org.xml.sax.SAXParseException; lineNumber: 128484149; columnNumber: 
94; Invalid byte 2 of 4-byte UTF-8 sequence.
      at 
org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
Source)
      at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
      at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
      at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
      at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
 Source)
      at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
      at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
      at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
      at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
      at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
      at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown 
Source)
      at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
      at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
      at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
      ... 1 more
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid 
byte 2 of 4-byte UTF-8 sequence.
      at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
      at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
      at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
      at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
      at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown 
Source)
      ... 11 more




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to