Robert Kaulbach created TIKA-3157:
-------------------------------------

             Summary: Missing content from .docx file with hyperlinked shape
                 Key: TIKA-3157
                 URL: https://issues.apache.org/jira/browse/TIKA-3157
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.24.1
            Reporter: Robert Kaulbach


The attached .docx file was created in MS Office, simply drew a rectangle and 
then added a hyperlink to it. While the hyperlink doesn't show inside 
LibreOffice, it's still there and clickable when opened with MS Office.

When parsing with Tika, the hyperlink attached to the shape is nowhere to be 
found in the output. Enabling all Office/OOXML parse options in the context has 
not helped.

 

When debugging, I can see the linked shape is being skipped at 
org/apache/tika/parser/microsoft/ooxml/OOXMLWordAndPowerPointTextHandler.java 
in the StartElement method, because "inACChoiceDepth" is greater than 0.

For my use case I'd like to extract as much information as possible from the 
document. It would be helpful if the parser config could either disable this 
check on "inACChoiceDepth" or increase the allowed limit before skipping 
content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to