Hi all,

Just a heads-up that we tracked down a serious issue we were having, while parsing about 100M docs.

A handful of these documents caused Tika's parsing to hang. We've got a FutureTask that we use to detect and (try to) terminate hung parses.

But for some of these parse attempts, we'd get lingering threads that were chewing up huge CPU cycles. They all look like:

"Thread-14524974" prio=10 tid=0x00002aab18650800 nid=0x6c36 runnable [0x0000000043c2a000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.xerces.impl.XMLScanner.scanExternalID(Unknown Source)
at org .apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown Source) at org.apache.xerces.impl.XMLDocumentScannerImpl $PrologDispatcher.dispatch(Unknown Source) at org .apache .xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
at org .apache .tika .detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:47)
        at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:236)
        at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:536)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java: 128)
        at bixo.parser.TikaCallable.call(TikaCallable.java:62)
        at bixo.parser.TikaCallable.call(TikaCallable.java:23)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.lang.Thread.run(Thread.java:619)

We traced it to a bug in XercesImpl (we're using 2.9.1), where the scanExternalID() method can loop if it gets to the end of the document without getting a matching quote character for a literal.

Unfortunately we don't know the exact documents that caused these problems, but my guess is that they're not really XML docs, or they're badly broken.

This looks like it's been fixed in Xerces 2.10.0, potentially as a side effect of the fix for https://issues.apache.org/jira/browse/XERCESJ-1357

-- Ken

PS - Unfortunately 2.10.0 hasn't been pushed to Maven central yet.

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to