[ https://issues.apache.org/jira/browse/TIKA-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann updated TIKA-1454: ------------------------------------ Fix Version/s: (was: 1.15) 1.16 > Extracting as HTML loses links in xlsx, ppt, and pptx files > ----------------------------------------------------------- > > Key: TIKA-1454 > URL: https://issues.apache.org/jira/browse/TIKA-1454 > Project: Tika > Issue Type: Bug > Affects Versions: 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12 > Environment: RedHat EL5, EL6, EL7 > Reporter: Chris Bryant > Assignee: Tim Allison > Fix For: 1.16 > > Attachments: testurl.ods, testurl.xlsx, urltest.odp, urltest.ppt, > urltest.pptx > > > I am trying to convert documents to HTML, then looking through the HTML for > anchor tags to find links to external URLs. This works fine when looking at > some document types, including PDFs, Open Document formats, Microsoft Word > formats .doc and .docx, and the older Microsoft Excel .xls format, but it > does not work for any Microsoft Powerpoint formats (.ppt or .pptx) and it > does not work for the newer Excel .xlsx format. For the .ppt, .pptx, and > .xlsx formats, the text is extracted properly and formatted into HTML, but > the link is not converted to an anchor tag. > I am running tika in --server --html mode. > I included samples of .xlsx, .ppt, and .pptx files that do not properly > extract links, and also included samples of .ods and .odp files that do > extract links properly. -- This message was sent by Atlassian JIRA (v6.3.15#6346)