[ https://issues.apache.org/jira/browse/TIKA-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353224#comment-14353224 ]
Nick Burch commented on TIKA-1568: ---------------------------------- Maybe we could look at putting the EncodingDetector on the TikaConfig object? Perhaps indirectly? It could potentially work like DefaultParser / DefaultDetector does, where in the default case it finds suitable classes only once, but can handle dynamic loading as well I'm not sure about having the parsers cache the EncodingDetector - I'm not sure we've got anything like that happening anywhere currently, do we? > AutoDetectReader performance problem > ------------------------------------ > > Key: TIKA-1568 > URL: https://issues.apache.org/jira/browse/TIKA-1568 > Project: Tika > Issue Type: Bug > Affects Versions: 1.7 > Reporter: Andrzej Bialecki > > Parsing performance of many text files suffers from repeated calls to > ServiceLoader.loadServiceProviders(EncodingDetector.class). This happens in > TXTParser, HTMLParser and SourceCodeParser. In most cases, when Tika is using > the default ServiceLoader instance created in the Parser's static section > this cost can be avoided by caching the resulting List<EncodingDetector> > either at a higher level in the Parser (as a static property). If using > custom ServiceLoader-s this can be achieved by putting this list in > ParsingContext, or caching these lists at a lower level in the ServiceLoader > component. > Relevant part of the stacktrace follows: > {code} > java.lang.Thread.State: BLOCKED (on object monitor) > at java.util.zip.ZipFile.getEntry(ZipFile.java:304) > - locked <0x00000007909d2e48> (a java.util.jar.JarFile) > at java.util.jar.JarFile.getEntry(JarFile.java:227) > at java.util.jar.JarFile.getJarEntry(JarFile.java:210) > at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:840) > at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:818) > at sun.misc.URLClassPath$1.next(URLClassPath.java:226) > at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:236) > at java.net.URLClassLoader$3$1.run(URLClassLoader.java:583) > at java.net.URLClassLoader$3$1.run(URLClassLoader.java:581) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader$3.next(URLClassLoader.java:580) > at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:605) > at java.util.Collections.list(Collections.java:3687) > at > org.eclipse.jetty.webapp.WebAppClassLoader.toList(WebAppClassLoader.java:337) > at > org.eclipse.jetty.webapp.WebAppClassLoader.getResources(WebAppClassLoader.java:321) > at > org.apache.tika.config.ServiceLoader.findServiceResources(ServiceLoader.java:210) > at > org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:277) > at > org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:306) > at > org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:228) > at > org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:104) > at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:70) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) > at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)