[jira] [Commented] (TIKA-1454) Extracting as HTML loses links in xlsx, ppt, and pptx files

Tim Allison (JIRA) Thu, 12 May 2016 06:44:12 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281524#comment-15281524
 ]


Tim Allison commented on TIKA-1454:
-----------------------------------

Thank you for opening this issue and supplying test docs.  For ppt and pptx, I 
have a reasonable patch.  We'll need to add some things into POI to make the 
extraction cleaner, but this should be good to go soonish.

For xlsx, it looks like we'll have to dump hyperlinks at the bottom of each 
sheet...we'd have to do a double pass to cache hyperlinks and insert them in 
the proper cells.  Not great, but at least we should be able to get the 
hyperlinks for your purposes.

> Extracting as HTML loses links in xlsx, ppt, and pptx files
> -----------------------------------------------------------
>
>                 Key: TIKA-1454
>                 URL: https://issues.apache.org/jira/browse/TIKA-1454
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12
>         Environment: RedHat EL5, EL6, EL7
>            Reporter: Chris Bryant
>            Assignee: Tim Allison
>         Attachments: testurl.ods, testurl.xlsx, urltest.odp, urltest.ppt, 
> urltest.pptx
>
>
> I am trying to convert documents to HTML, then looking through the HTML for 
> anchor tags to find links to external URLs.  This works fine when looking at 
> some document types, including PDFs, Open Document formats, Microsoft Word 
> formats .doc and .docx, and the older Microsoft Excel .xls format, but it 
> does not work for any Microsoft Powerpoint formats (.ppt or .pptx) and it 
> does not work for the newer Excel .xlsx format.  For the .ppt, .pptx, and 
> .xlsx formats, the text is extracted properly and formatted into HTML, but 
> the link is not converted to an anchor tag.
> I am running tika in --server --html mode.
> I included samples of .xlsx, .ppt, and .pptx files that do not properly 
> extract links, and also included samples of .ods and .odp files that do 
> extract links properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1454) Extracting as HTML loses links in xlsx, ppt, and pptx files

Reply via email to