[ 
https://issues.apache.org/jira/browse/TIKA-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kaulbach updated TIKA-3157:
----------------------------------
    Description: 
The attached .docx file was created in MS Office, simply drew a rectangle and 
then added a hyperlink to it. While the hyperlink doesn't show inside 
LibreOffice, it's still there and clickable when opened with MS Office.

When parsing with Tika, the hyperlink attached to the shape is nowhere to be 
found in the output. Enabling all Office/OOXML parse options in the context has 
not helped.

 

When debugging, I can see the "a:hlinkClick" tag with the link inside is being 
skipped at 
org/apache/tika/parser/microsoft/ooxml/OOXMLWordAndPowerPointTextHandler.java 
in the StartElement method, because "inACChoiceDepth" is greater than 0.

And then the fallback tag, which also has the link inside a

  was:
The attached .docx file was created in MS Office, simply drew a rectangle and 
then added a hyperlink to it. While the hyperlink doesn't show inside 
LibreOffice, it's still there and clickable when opened with MS Office.

When parsing with Tika, the hyperlink attached to the shape is nowhere to be 
found in the output. Enabling all Office/OOXML parse options in the context has 
not helped.

 

When debugging, I can see the linked shape is being skipped at 
org/apache/tika/parser/microsoft/ooxml/OOXMLWordAndPowerPointTextHandler.java 
in the StartElement method, because "inACChoiceDepth" is greater than 0.

For my use case I'd like to extract as much information as possible from the 
document. It would be helpful if the parser config could either disable this 
check on "inACChoiceDepth" or increase the allowed limit before skipping 
content.


> Missing content from .docx file with hyperlinked shape
> ------------------------------------------------------
>
>                 Key: TIKA-3157
>                 URL: https://issues.apache.org/jira/browse/TIKA-3157
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.24.1
>            Reporter: Robert Kaulbach
>            Priority: Minor
>
> The attached .docx file was created in MS Office, simply drew a rectangle and 
> then added a hyperlink to it. While the hyperlink doesn't show inside 
> LibreOffice, it's still there and clickable when opened with MS Office.
> When parsing with Tika, the hyperlink attached to the shape is nowhere to be 
> found in the output. Enabling all Office/OOXML parse options in the context 
> has not helped.
>  
> When debugging, I can see the "a:hlinkClick" tag with the link inside is 
> being skipped at 
> org/apache/tika/parser/microsoft/ooxml/OOXMLWordAndPowerPointTextHandler.java 
> in the StartElement method, because "inACChoiceDepth" is greater than 0.
> And then the fallback tag, which also has the link inside a



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to