[ https://issues.apache.org/jira/browse/TIKA-632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080449#comment-13080449 ]
Cristian Vat commented on TIKA-632: ----------------------------------- Tika uses RTFEditorKit from javax.swing.text.rtf for the actual RTF Parsing and that doesn't seem to support links. In the example you provided links are actually marked using two methods: - \htmlrtf tags which are "Control Words Introduced by Specific/Other Microsoft Products" - \field instances of type hyperlink, which are seem to be the normal RTF way of adding links However the RTF Parser in Swing ignores a lot of "unknown" control words, including \field completely. For reference, there is a bug opened in 1999 and closed as "Will Not Fix" to enhance RTF Parsing ( http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4261277 ) To quote Jukka from another issue: "there's little we can do about this as long as we're stuck with the Swing RTF parser". > Rtf parsing ignores links > ------------------------- > > Key: TIKA-632 > URL: https://issues.apache.org/jira/browse/TIKA-632 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.9 > Reporter: Nick Burch > Attachments: test.rtf > > > I spotted this while working on TIKA-631 - an RTF file containing links has > the link skipped over - neither the link text nor the link href are output. > In the attached sample file (which is the RTF contents of > /test-documents/test-outlook2003.msg), we should see things like: > [a > href="http://r.office.microsoft.com/r/rlidOutlookWelcomeMail1?clid=1033">Streamlined > Mail Experience[/a> - Outlook > Instead, all we get is " - Outlook" -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira