System.err prints from XmlRootExtractor
---------------------------------------
Key: TIKA-239
URL: https://issues.apache.org/jira/browse/TIKA-239
Project: Tika
Issue Type: Bug
Components: parser
Reporter: Jukka Zitting
The XmlRootExtractor is often given non-XML files to look at, which causes the
XML parser to fail with error messages. It looks like the default behaviour is
to print some of these error messages to System.err, as shown below:
$ java -jar tika-app-0.4-SNAPSHOT.jar --text lucene-2.2.0-src.zip > /dev/null
java.io.EOFException
at
com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.endEntity(XMLDTDScannerImpl.java:662)
at
com.sun.org.apache.xerces.internal.impl.XMLEntityManager.endEntity(XMLEntityManager.java:1393)
at
com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1763)
at
com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipSpaces(XMLEntityScanner.java:1543)
at
com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.skipSeparator(XMLDTDScannerImpl.java:2055)
at
com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanEntityDecl(XMLDTDScannerImpl.java:1533)
at
com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDecls(XMLDTDScannerImpl.java:1986)
at
com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDTDInternalSubset(XMLDTDScannerImpl.java:377)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1140)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1089)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:976)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
at
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:807)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:107)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
at
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
at
org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:55)
at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:219)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:514)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:76)
at
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:83)
at
org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:65)
at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:56)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:85)
at
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:83)
at
org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:65)
at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:56)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:85)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:116)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:57)
Tika should never print stuff to System.out or System.err unless explicitly
instructed to do so.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.