[ 
https://issues.apache.org/jira/browse/TIKA-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kaulbach updated TIKA-3157:
----------------------------------
    Description: 
The attached .docx file was created in MS Office, simply drew a rectangle and 
then added a hyperlink to it. While the hyperlink doesn't show inside 
LibreOffice, it's still there and clickable when opened with MS Office.

When parsing with Tika, the hyperlink attached to the shape is nowhere to be 
found in the output. Enabling all Office/OOXML parse options in the context has 
not helped.

 

When debugging, I can see the "a:hlinkClick" tag with the link inside is being 
skipped at 
org/apache/tika/parser/microsoft/ooxml/OOXMLWordAndPowerPointTextHandler.java 
in the StartElement method, because "inACChoiceDepth" is greater than 0.

And then the fallback tag, which separately has the link inside a "v:rect" tag, 
doesn't seem to get processed and doesn't save the link content.

  was:
The attached .docx file was created in MS Office, simply drew a rectangle and 
then added a hyperlink to it. While the hyperlink doesn't show inside 
LibreOffice, it's still there and clickable when opened with MS Office.

When parsing with Tika, the hyperlink attached to the shape is nowhere to be 
found in the output. Enabling all Office/OOXML parse options in the context has 
not helped.

 

When debugging, I can see the "a:hlinkClick" tag with the link inside is being 
skipped at 
org/apache/tika/parser/microsoft/ooxml/OOXMLWordAndPowerPointTextHandler.java 
in the StartElement method, because "inACChoiceDepth" is greater than 0.

And then the fallback tag, which also has the link inside a


> Missing content from .docx file with hyperlinked shape
> ------------------------------------------------------
>
>                 Key: TIKA-3157
>                 URL: https://issues.apache.org/jira/browse/TIKA-3157
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.24.1
>            Reporter: Robert Kaulbach
>            Priority: Minor
>
> The attached .docx file was created in MS Office, simply drew a rectangle and 
> then added a hyperlink to it. While the hyperlink doesn't show inside 
> LibreOffice, it's still there and clickable when opened with MS Office.
> When parsing with Tika, the hyperlink attached to the shape is nowhere to be 
> found in the output. Enabling all Office/OOXML parse options in the context 
> has not helped.
>  
> When debugging, I can see the "a:hlinkClick" tag with the link inside is 
> being skipped at 
> org/apache/tika/parser/microsoft/ooxml/OOXMLWordAndPowerPointTextHandler.java 
> in the StartElement method, because "inACChoiceDepth" is greater than 0.
> And then the fallback tag, which separately has the link inside a "v:rect" 
> tag, doesn't seem to get processed and doesn't save the link content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to