[ 
https://issues.apache.org/jira/browse/TIKA-3839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3839.
-------------------------------
    Fix Version/s: 2.4.2
       Resolution: Fixed

> Property com.ctc.wstx.maxEntityCount is not supported
> -----------------------------------------------------
>
>                 Key: TIKA-3839
>                 URL: https://issues.apache.org/jira/browse/TIKA-3839
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.4.1
>            Reporter: Lakatos Gyula
>            Priority: Minor
>             Fix For: 2.4.2
>
>         Attachments: 8a4b2154-b6c1-4e0e-b8be-8ce4e68c454f.pdf
>
>
> First of all, this might not even be a bug, just a slight annoyance.
> Whenever I try to parse the attached PDF, I get the following error:
>  
> {code:java}
> [main] WARN org.apache.tika.utils.XMLReaderUtils - SAX Security Manager could 
> not be setup [log suppressed for 5 minutes]
> java.lang.IllegalArgumentException: Property com.ctc.wstx.maxEntityCount is 
> not supported
>     at 
> java.xml/com.sun.xml.internal.stream.XMLInputFactoryImpl.setProperty(XMLInputFactoryImpl.java:246)
>     at 
> org.apache.tika.utils.XMLReaderUtils.trySetStaxSecurityManager(XMLReaderUtils.java:732)
>     at 
> org.apache.tika.utils.XMLReaderUtils.getXMLInputFactory(XMLReaderUtils.java:303)
>     at 
> org.apache.tika.parser.ParseContext.getXMLInputFactory(ParseContext.java:229)
>     at org.apache.tika.parser.pdf.XFAExtractor.extract(XFAExtractor.java:90)
>     at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractAcroForm(AbstractPDF2XHTML.java:863)
>     at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:772)
>     at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:270)
>     at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:97)
>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:170) {code}
>  
> After a couple of hours of Googling, I realized that there is an XML parser 
> implementation called woodstox. If I include that dependency on the 
> classpath, this exception is no longer present, because it understands the 
> _com.ctc.wstx.maxEntityCount_ property.
> As far as I see, the 1.28.4 version of tika-parsers included this library as 
> a compile-time dependency, however, 2.0.0+ doesn't. I'm not sure why this is 
> the case, but there must be a good reason for it.
> However, I think it would be a good idea to change the exception's message 
> from:
> {code:java}
> SAX Security Manager could not be setup [log suppressed for 5 minutes] {code}
> to something more meaningful.
> Something that mentions woodstox would be good (especially if the only 
> property that Tika tries to set is woodstox specific). Also, 
> spamming/printing the message every 5 minutes is pointless in my opinion. If 
> woodstox is not on the classpath, it will fail anyways (and also it is not 
> synchronized so if you parsing a lot of documents at the same time in 
> parallel, it still can print it more than once).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to