Hello,

I am attempting to process some .msg files that have a considerable amount
of nested content that throw a "Suspected zip bomb: 100 levels of XML
element nesting" when using the out of the box HtmlParser +
EmbeddedContentHandler/BodyContentHandler on the body of the emails. Sadly
I am not able to provide these emails for review.

I have been trying to figure out how to modify the maximum allowed depth
for the content handlers that HtmlParser uses, but I'm a bit lost in the
weeds trying to figure out how the ContentHandlerDecorator pattern works
and how the SecureContentHandler can be loaded with a different config.

I looked over TIKA-2091, but this seems to be proprietary to a
different project. All other googling turned up Solr specific StackOverflow
threads for this zip bomb error. Any chance someone could point me to
documentation on what all changes need to be made to increase this limit
(or, if I can modify this in tika-config.xml instead)?

-- 
This message and any attachments constitute electronic communication within 
the meaning of the Electronic Communications Privacy Act, 18 U.S.C. ยงยง 
2510-2521, is intended for the recipient(s) only and may contain 
confidential and/or privileged information. If you are not the intended 
recipient, do not read, copy, distribute or use this information. If 
received in error, notify sender immediately by reply e-mail and delete 
this message.

Reply via email to