[ https://issues.apache.org/jira/browse/TIKA-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cristian Vat updated TIKA-2837: ------------------------------- Description: I got a StackOverflowError while parsing a large PDF file using ToHTMLContentHandler. Trace: {noformat} java.lang.StackOverflowError: null at java.base/java.util.HashMap.hash(HashMap.java:339) ~[na:na] at java.base/java.util.HashMap.get(HashMap.java:552) ~[na:na] at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:54) ~[tika-core-1.20.jar:1.20] at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) ~[tika-core-1.20.jar:1.20] ....about 1000 recursive calls... at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) ~[tika-core-1.20.jar:1.20] {noformat} Error was received in a Spring Boot command-line app also doing other processing. I couldn't duplicate it with a standalone example, possibly standalone it doesn't completely fill up the stack. Also no error in standalone tika app running with GUI or as command-line. PDF File: "10.1007-s00268-016-3727-3.pdf" can be downloaded from [https://www.researchgate.net/publication/309385633_Safety_of_Nonsteroidal_Anti-inflammatory_Drugs_in_Major_Gastrointestinal_Surgery_A_Prospective_Multicenter_Cohort_Study] Generated output has 4681 <meta> tags Maximum tag depth of generated (X)HTML is 6 I then timed parsing with ToHTMLContentHandler versus directly with ToXMLContentHandler. After a warmup of a few hundred parse calls times were: - ToHTMLContentHandler: avg 500 ms - ToXMLContentHandler: avg 80-90 ms Profiling with YourKit showed a hotspot and very deep stack in recursive calls on ToXMLContentHandler$ElementInfo.getPrefix(String) in ToXMLContentHandler.java:58, same as was in the StackOverflowError Checking the code I found ToXMLContentHandler.endElement has a mention and a fix of old similar issue TIKA-1070: {code:java} // Reset the position in the tree, to avoid endless stack overflow // chains (see TIKA-1070) currentElement = currentElement.parent; {code} But ToHTMLContentHandler.endElement doesn't call super.endElement in case of empty elements including the <meta> tag. Thus the currentElement parents keep growing in this case? I created my own version of ToHTMLContentHandler where I called super.endElement inside the EMPTY_ELEMENTS if and: - no more StackOverflowError in the spring boot app - parse times reduced to XML version one, so 5x speed improvement at least - output is identical except additional "</meta>" closing tag. Questions: - should anybody be using ToHTMLContentHandler instead of ToXMLContentHandler ? Not sure on the exact use-case since information seems to be the same and there exist unaffected XML and XHTML content handlers - any way that ToHTMLContentHandler could be improved but without emitting extra "</meta>" closing tag? was: I got a StackOverflowError while parsing a large PDF file using ToHTMLContentHandler. Trace: {noformat} java.lang.StackOverflowError: null at java.base/java.util.HashMap.hash(HashMap.java:339) ~[na:na] at java.base/java.util.HashMap.get(HashMap.java:552) ~[na:na] at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:54) ~[tika-core-1.20.jar:1.20] at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) ~[tika-core-1.20.jar:1.20] ....about 1000 recursive calls... at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) ~[tika-core-1.20.jar:1.20] {noformat} Error was received in a Spring Boot command-line app also doing other processing. I couldn't duplicate it with a standalone example, possibly standalone it doesn't completely fill up the stack. Also no error in standalone tika app running with GUI or as command-line. PDF File: "10.1007-s00268-016-3727-3.pdf" can be downloaded from [https://www.researchgate.net/publication/309385633_Safety_of_Nonsteroidal_Anti-inflammatory_Drugs_in_Major_Gastrointestinal_Surgery_A_Prospective_Multicenter_Cohort_Study] Generated output has 4681 <meta> tags Maximum tag depth of generated (X)HTML is 6 I then timed parsing with ToHTMLContentHandler versus directly with ToXMLContentHandler. After a warmup of a few hundred parse calls times were: - ToHTMLContentHandler: avg 500 ms - ToXMLContentHandler: avg 80-90 ms Profiling with YourKit showed a hotspot and very deep stack in recursive calls on ToXMLContentHandler$ElementInfo.getPrefix(String) in ToXMLContentHandler.java:58, same as was in the StackOverflowError Checking the code I found ToXMLContentHandler.endElement has a mention and a fix of old similar issue TIKA-1070: {code:java} // Reset the position in the tree, to avoid endless stack overflow // chains (see TIKA-1070) currentElement = currentElement.parent; {code} But ToHTMLContentHandler.endElement doesn't call super.endElement in case of empty elements including the <meta> tag. Thus the currentElement parents keep growing in this case? I created my own version of ToHTMLContentHandler where I called super.endElement inside the EMPTY_ELEMENTS if and: - no more StackOverflowError in the spring boot app - parse times reduced to XML version one, so 5x speed improvement at least - output is identical except additional "</meta>" closing tag. Questions: - should anybody be using ToHTMLContentHandler instead of ToXMLContentHandler ? Not sure on the exact use-case since information seems to be the same and there exist unaffected XML and XHTML content handlers -- any way that ToHTMLContentHandler could be improved but without emitting extra "</meta>" closing tag? > Performance/Stability problem in ToHTMLContentHandler > ----------------------------------------------------- > > Key: TIKA-2837 > URL: https://issues.apache.org/jira/browse/TIKA-2837 > Project: Tika > Issue Type: Bug > Reporter: Cristian Vat > Priority: Major > > I got a StackOverflowError while parsing a large PDF file using > ToHTMLContentHandler. Trace: > {noformat} > java.lang.StackOverflowError: null > at java.base/java.util.HashMap.hash(HashMap.java:339) ~[na:na] > at java.base/java.util.HashMap.get(HashMap.java:552) ~[na:na] > at > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:54) > ~[tika-core-1.20.jar:1.20] > at > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) > ~[tika-core-1.20.jar:1.20] > ....about 1000 recursive calls... > at > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58) > ~[tika-core-1.20.jar:1.20] > {noformat} > > Error was received in a Spring Boot command-line app also doing other > processing. > I couldn't duplicate it with a standalone example, possibly standalone > it doesn't completely fill up the stack. > Also no error in standalone tika app running with GUI or as command-line. > > PDF File: "10.1007-s00268-016-3727-3.pdf" can be downloaded from > > [https://www.researchgate.net/publication/309385633_Safety_of_Nonsteroidal_Anti-inflammatory_Drugs_in_Major_Gastrointestinal_Surgery_A_Prospective_Multicenter_Cohort_Study] > Generated output has 4681 <meta> tags > Maximum tag depth of generated (X)HTML is 6 > > I then timed parsing with ToHTMLContentHandler versus directly with > ToXMLContentHandler. After a warmup of a few hundred parse calls times > were: > - ToHTMLContentHandler: avg 500 ms > - ToXMLContentHandler: avg 80-90 ms > > Profiling with YourKit showed a hotspot and very deep stack in > recursive calls on ToXMLContentHandler$ElementInfo.getPrefix(String) > in ToXMLContentHandler.java:58, same as was in the StackOverflowError > Checking the code I found ToXMLContentHandler.endElement has a mention > and a fix of old similar issue TIKA-1070: > {code:java} > // Reset the position in the tree, to avoid endless stack overflow > // chains (see TIKA-1070) > currentElement = currentElement.parent; > {code} > But ToHTMLContentHandler.endElement doesn't call super.endElement in > case of empty elements including the <meta> tag. Thus the > currentElement parents keep growing in this case? > > I created my own version of ToHTMLContentHandler where I called > super.endElement inside the EMPTY_ELEMENTS if and: > - no more StackOverflowError in the spring boot app > - parse times reduced to XML version one, so 5x speed improvement at least > - output is identical except additional "</meta>" closing tag. > > Questions: > - should anybody be using ToHTMLContentHandler instead of > ToXMLContentHandler ? Not sure on the exact use-case since information > seems to be the same and there exist unaffected XML and XHTML content > handlers > - any way that ToHTMLContentHandler could be improved but without emitting > extra "</meta>" closing tag? -- This message was sent by Atlassian JIRA (v7.6.3#76005)