[jira] [Updated] (TIKA-2837) Performance/Stability problem in ToHTMLContentHandler

Cristian Vat (JIRA) Sat, 02 Mar 2019 08:55:17 -0800


     [ 
https://issues.apache.org/jira/browse/TIKA-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Cristian Vat updated TIKA-2837:
-------------------------------
    Description: 
I got a StackOverflowError while parsing a large PDF file using
 ToHTMLContentHandler. Trace:
{noformat}
java.lang.StackOverflowError: null
    at java.base/java.util.HashMap.hash(HashMap.java:339) ~[na:na]
    at java.base/java.util.HashMap.get(HashMap.java:552) ~[na:na]
    at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:54)
~[tika-core-1.20.jar:1.20]
    at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
~[tika-core-1.20.jar:1.20]
....about 1000 recursive calls...
    at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
~[tika-core-1.20.jar:1.20]
{noformat}
 

Error was received in a Spring Boot command-line app also doing other
 processing.
 I couldn't duplicate it with a standalone example, possibly standalone
 it doesn't completely fill up the stack.
 Also no error in standalone tika app running with GUI or as command-line.

 

PDF File: "10.1007-s00268-016-3727-3.pdf" can be downloaded from
 
[https://www.researchgate.net/publication/309385633_Safety_of_Nonsteroidal_Anti-inflammatory_Drugs_in_Major_Gastrointestinal_Surgery_A_Prospective_Multicenter_Cohort_Study]
 Generated output has 4681 <meta> tags
 Maximum tag depth of generated (X)HTML is 6

 

I then timed parsing with ToHTMLContentHandler versus directly with
 ToXMLContentHandler. After a warmup of a few hundred parse calls times
 were:
 - ToHTMLContentHandler: avg 500 ms
 - ToXMLContentHandler: avg 80-90 ms

 

Profiling with YourKit showed a hotspot and very deep stack in
 recursive calls on ToXMLContentHandler$ElementInfo.getPrefix(String)
 in ToXMLContentHandler.java:58, same as was in the StackOverflowError

Checking the code I found ToXMLContentHandler.endElement has a mention
 and a fix of old similar issue TIKA-1070:
{code:java}
// Reset the position in the tree, to avoid endless stack overflow
// chains (see TIKA-1070)
currentElement = currentElement.parent;
{code}
But ToHTMLContentHandler.endElement doesn't call super.endElement in
 case of empty elements including the <meta> tag. Thus the
 currentElement parents keep growing in this case?

 

I created my own version of ToHTMLContentHandler where I called
 super.endElement inside the EMPTY_ELEMENTS if and:
 - no more StackOverflowError in the spring boot app
 - parse times reduced to XML version one, so 5x speed improvement at least
 - output is identical except additional "</meta>" closing tag.

 

Questions:
 - should anybody be using ToHTMLContentHandler instead of
 ToXMLContentHandler ? Not sure on the exact use-case since information
 seems to be the same and there exist unaffected XML and XHTML content handlers
 - any way that ToHTMLContentHandler could be improved but without emitting 
extra "</meta>" closing tag?

  was:
I got a StackOverflowError while parsing a large PDF file using
 ToHTMLContentHandler. Trace:
{noformat}
java.lang.StackOverflowError: null
    at java.base/java.util.HashMap.hash(HashMap.java:339) ~[na:na]
    at java.base/java.util.HashMap.get(HashMap.java:552) ~[na:na]
    at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:54)
~[tika-core-1.20.jar:1.20]
    at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
~[tika-core-1.20.jar:1.20]
....about 1000 recursive calls...
    at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
~[tika-core-1.20.jar:1.20]
{noformat}
 

Error was received in a Spring Boot command-line app also doing other
 processing.
 I couldn't duplicate it with a standalone example, possibly standalone
 it doesn't completely fill up the stack.
 Also no error in standalone tika app running with GUI or as command-line.

 

PDF File: "10.1007-s00268-016-3727-3.pdf" can be downloaded from
 
[https://www.researchgate.net/publication/309385633_Safety_of_Nonsteroidal_Anti-inflammatory_Drugs_in_Major_Gastrointestinal_Surgery_A_Prospective_Multicenter_Cohort_Study]
 Generated output has 4681 <meta> tags
 Maximum tag depth of generated (X)HTML is 6

 

I then timed parsing with ToHTMLContentHandler versus directly with
 ToXMLContentHandler. After a warmup of a few hundred parse calls times
 were:
 - ToHTMLContentHandler: avg 500 ms
 - ToXMLContentHandler: avg 80-90 ms

 

Profiling with YourKit showed a hotspot and very deep stack in
 recursive calls on ToXMLContentHandler$ElementInfo.getPrefix(String)
 in ToXMLContentHandler.java:58, same as was in the StackOverflowError

Checking the code I found ToXMLContentHandler.endElement has a mention
 and a fix of old similar issue TIKA-1070:
{code:java}
// Reset the position in the tree, to avoid endless stack overflow
// chains (see TIKA-1070)
currentElement = currentElement.parent;
{code}
But ToHTMLContentHandler.endElement doesn't call super.endElement in
 case of empty elements including the <meta> tag. Thus the
 currentElement parents keep growing in this case?

 

I created my own version of ToHTMLContentHandler where I called
 super.endElement inside the EMPTY_ELEMENTS if and:
 - no more StackOverflowError in the spring boot app
 - parse times reduced to XML version one, so 5x speed improvement at least
 - output is identical except additional "</meta>" closing tag.

 

Questions:
 - should anybody be using ToHTMLContentHandler instead of
 ToXMLContentHandler ? Not sure on the exact use-case since information
 seems to be the same and there exist unaffected XML and XHTML content handlers

 -- any way that ToHTMLContentHandler could be improved but without
 emitting extra "</meta>" closing tag?


> Performance/Stability problem in ToHTMLContentHandler
> -----------------------------------------------------
>
>                 Key: TIKA-2837
>                 URL: https://issues.apache.org/jira/browse/TIKA-2837
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Cristian Vat
>            Priority: Major
>
> I got a StackOverflowError while parsing a large PDF file using
>  ToHTMLContentHandler. Trace:
> {noformat}
> java.lang.StackOverflowError: null
>     at java.base/java.util.HashMap.hash(HashMap.java:339) ~[na:na]
>     at java.base/java.util.HashMap.get(HashMap.java:552) ~[na:na]
>     at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:54)
> ~[tika-core-1.20.jar:1.20]
>     at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
> ~[tika-core-1.20.jar:1.20]
> ....about 1000 recursive calls...
>     at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
> ~[tika-core-1.20.jar:1.20]
> {noformat}
>  
> Error was received in a Spring Boot command-line app also doing other
>  processing.
>  I couldn't duplicate it with a standalone example, possibly standalone
>  it doesn't completely fill up the stack.
>  Also no error in standalone tika app running with GUI or as command-line.
>  
> PDF File: "10.1007-s00268-016-3727-3.pdf" can be downloaded from
>  
> [https://www.researchgate.net/publication/309385633_Safety_of_Nonsteroidal_Anti-inflammatory_Drugs_in_Major_Gastrointestinal_Surgery_A_Prospective_Multicenter_Cohort_Study]
>  Generated output has 4681 <meta> tags
>  Maximum tag depth of generated (X)HTML is 6
>  
> I then timed parsing with ToHTMLContentHandler versus directly with
>  ToXMLContentHandler. After a warmup of a few hundred parse calls times
>  were:
>  - ToHTMLContentHandler: avg 500 ms
>  - ToXMLContentHandler: avg 80-90 ms
>  
> Profiling with YourKit showed a hotspot and very deep stack in
>  recursive calls on ToXMLContentHandler$ElementInfo.getPrefix(String)
>  in ToXMLContentHandler.java:58, same as was in the StackOverflowError
> Checking the code I found ToXMLContentHandler.endElement has a mention
>  and a fix of old similar issue TIKA-1070:
> {code:java}
> // Reset the position in the tree, to avoid endless stack overflow
> // chains (see TIKA-1070)
> currentElement = currentElement.parent;
> {code}
> But ToHTMLContentHandler.endElement doesn't call super.endElement in
>  case of empty elements including the <meta> tag. Thus the
>  currentElement parents keep growing in this case?
>  
> I created my own version of ToHTMLContentHandler where I called
>  super.endElement inside the EMPTY_ELEMENTS if and:
>  - no more StackOverflowError in the spring boot app
>  - parse times reduced to XML version one, so 5x speed improvement at least
>  - output is identical except additional "</meta>" closing tag.
>  
> Questions:
>  - should anybody be using ToHTMLContentHandler instead of
>  ToXMLContentHandler ? Not sure on the exact use-case since information
>  seems to be the same and there exist unaffected XML and XHTML content 
> handlers
>  - any way that ToHTMLContentHandler could be improved but without emitting 
> extra "</meta>" closing tag?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TIKA-2837) Performance/Stability problem in ToHTMLContentHandler

Reply via email to