[
https://issues.apache.org/jira/browse/TIKA-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting resolved TIKA-374.
--------------------------------
Resolution: Fixed
Fix Version/s: 0.7
Assignee: Jukka Zitting
Thanks for the accurate analysis of the problem. I fixed this in revision
903775 by making each call to XmlRootExtractor.extractRootElement() use a new
SAXParser instance.
> AutoDetectParser not thread-safe?
> ---------------------------------
>
> Key: TIKA-374
> URL: https://issues.apache.org/jira/browse/TIKA-374
> Project: Tika
> Issue Type: Bug
> Components: mime
> Affects Versions: 0.5
> Environment: Dell E6400 (dual-core) running 64-bit Windows 7. Also
> reproduced on an 8-processor Mac OS/X server.
> Reporter: Adam Rauch
> Assignee: Jukka Zitting
> Fix For: 0.7
>
>
> We are using Tika 0.5 to parse files that are added to a Lucene index. If we
> assign multiple threads to the parsing task we find that the
> AutoDetectParser.parse() method occasionally fails to return. In our case,
> it appears that a HashMap inside Xerces gets corrupted, causing an infinite
> loop inside HashMap.get(). This seems to be a concurrency problem; we have
> not seen the issue when running single threaded.
> Other posts have stated that AutoDetectParser is thread-safe. A quick look
> at the source code shows that an AutoDetectParser holds a MimeTypes which
> holds an XmlRootExtractor which holds a SAXParser. As a result, a single
> SAXParser instance can end up simultaneously parsing documents in multiple
> threads. The Java 1.4 SAXParser JavaDoc clearly states that "An
> implementation of SAXParser is NOT guaranteed to behave as per the
> specification if it is used concurrently by two or more threads." More
> recent versions of the JavaDoc have removed the warning, though the presence
> of "setProperty()" certainly means that a SAXParser is not immutable. As you
> can see from the stack trace below, properties seem to be the issue in this
> case.
> We've tried to work around the issue by constructing a new AutoDetectParser
> for each file we parse, but this doesn't solve the problem. Multiple
> AutoDectectParsers can still end up sharing a single instance of MimeTypes,
> because TikaConfig holds a MimeTypes instance statically (??) and updates it
> without synchronization (??).
> java.lang.Thread.State: RUNNABLE
> at java.util.HashMap.get(HashMap.java:303)
> at
> org.apache.xerces.util.ParserConfigurationSettings.getProperty(ParserConfigurationSettings.java:224)
> at
> org.apache.xerces.impl.dtd.XMLDTDProcessor.reset(XMLDTDProcessor.java:344)
> at
> org.apache.xerces.parsers.XML11Configuration.reset(XML11Configuration.java:984)
> at
> org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:806)
> at
> org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:768)
> at org.apache.xerces.parsers.XMLParser.parse(XMLParser.java:108)
> at
> org.apache.xerces.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1196)
> at
> org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:555)
> at
> org.apache.xerces.jaxp.SAXParserImpl.parse(SAXParserImpl.java:289)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
> at
> org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:63)
> at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:237)
> at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:534)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:92)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> at
> org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:170)
> at
> org.labkey.search.model.AbstractSearchService.preprocess(AbstractSearchService.java:664)
> at
> org.labkey.search.model.AbstractSearchService.getPreprocessedItem(AbstractSearchService.java:737)
> at
> org.labkey.search.model.AbstractSearchService$7.run(AbstractSearchService.java:773)
> at java.lang.Thread.run(Thread.java:637)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.