[ https://issues.apache.org/jira/browse/TIKA-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-719. -------------------------------- Resolution: Duplicate Resolving as a duplicate. > Concurrent usage of HtmlParser causes infinite loop in HashMap > -------------------------------------------------------------- > > Key: TIKA-719 > URL: https://issues.apache.org/jira/browse/TIKA-719 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.9 > Environment: SLES 10, JBoss 4.2 > Reporter: Christian Goeller > Assignee: Ken Krugler > > When using Tika in a multithreaded environment I encounter sometimes 2 > different types of problems with the HtmlParser > 1. NullPointerException in HtmlParser > java.lang.NullPointerException > null > org.ccil.cowan.tagsoup.Element.<init>(Element.java:39) > org.ccil.cowan.tagsoup.Parser.setup(Parser.java:467) > org.ccil.cowan.tagsoup.Parser.parse(Parser.java:439) > org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198) > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148) > 2. infinite loop in HashMap > java.util.HashMap.get(HashMap.java:303) > org.ccil.cowan.tagsoup.Schema.getElementType(Schema.java:122) > org.ccil.cowan.tagsoup.Parser.gi(Parser.java:959) > org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:505) > org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449) > org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198) > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148) > Having a closer look at the code of HtmlParser and several tagsoup classes I > assume that the problems comes > from the concurrent usage of a HashMap. > The class org.apache.tika.parser.html.HtmlParser has a static field > HTML_SCHEMA. > {code} > /** > * HTML schema singleton used to amortize the heavy instantiation time. > */ > private static final Schema HTML_SCHEMA = new HTMLSchema(); > {code} > The class HTMLSchema has a field elementTypes of type HashMap. > {code} > private HashMap theElementTypes = new HashMap(); // String -> > ElementType > {code} > As the HtmlSchema is held static the HashMap is accessed concurrently. > See here for a similar problem: > http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6423457 > Maybe the HTML_SCHEMA should not be static. > Unfortunately the bug cannot be reproduced easily. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira