Lakatos Gyula created TIKA-3839: ----------------------------------- Summary: Property com.ctc.wstx.maxEntityCount is not supported Key: TIKA-3839 URL: https://issues.apache.org/jira/browse/TIKA-3839 Project: Tika Issue Type: Bug Affects Versions: 2.4.1 Reporter: Lakatos Gyula Attachments: 8a4b2154-b6c1-4e0e-b8be-8ce4e68c454f.pdf
First of all, this might not even be a bug, just a slight annoyance. Whenever I try to parse the attached PDF, I get the following error: {code:java} [main] WARN org.apache.tika.utils.XMLReaderUtils - SAX Security Manager could not be setup [log suppressed for 5 minutes] java.lang.IllegalArgumentException: Property com.ctc.wstx.maxEntityCount is not supported at java.xml/com.sun.xml.internal.stream.XMLInputFactoryImpl.setProperty(XMLInputFactoryImpl.java:246) at org.apache.tika.utils.XMLReaderUtils.trySetStaxSecurityManager(XMLReaderUtils.java:732) at org.apache.tika.utils.XMLReaderUtils.getXMLInputFactory(XMLReaderUtils.java:303) at org.apache.tika.parser.ParseContext.getXMLInputFactory(ParseContext.java:229) at org.apache.tika.parser.pdf.XFAExtractor.extract(XFAExtractor.java:90) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractAcroForm(AbstractPDF2XHTML.java:863) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:772) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:270) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:97) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:170) {code} After a couple of hours of Googling, I realized that there is an XML parser implementation called woodstox. If I include that dependency on the classpath, this exception is no longer present, because it understands the _com.ctc.wstx.maxEntityCount_ property. As far as I see, the 1.28.4 version of tika-parsers included this library as a compile-time dependency, however, 2.0.0+ doesn't. I'm not sure why this is the case, but there must be a good reason for it. However, I think it would be a good idea to change the exception's message from: {code:java} SAX Security Manager could not be setup [log suppressed for 5 minutes] {code} to something more meaningful. Something that mentions woodstox would be good (especially if the only property that Tika tries to set is woodstox specific). Also, spamming/printing the message every 5 minutes is pointless in my opinion. If woodstox is not on the classpath, it will fail anyways. -- This message was sent by Atlassian Jira (v8.20.10#820010)