[ https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Burch resolved TIKA-861. ----------------------------- Resolution: Fixed Thanks, patches committed in r1331434. One thing to note is that links are extracted for now at the end of the page. Further work may be wanted in future, in order to match them to the text they apply to > Parse links in PDF > ------------------ > > Key: TIKA-861 > URL: https://issues.apache.org/jira/browse/TIKA-861 > Project: Tika > Issue Type: New Feature > Components: parser > Affects Versions: 1.0 > Reporter: Sasha Goodman > Priority: Minor > Labels: links, pdfbox > Fix For: 1.2 > > Attachments: TIKA-861-test.patch, TIKA-861.patch > > Original Estimate: 4h > Remaining Estimate: 4h > > Currently the XHTML doesn't contain links, although PDFBox parses them. I'm > new to Tika and haven't done java for 6 years, but someone more experienced > could probably do this in a few hours. > The PDF2XHTML method loops through the annotations. > See: > {code:java} > 136: for(Object o : page.getAnnotations()) { > {code} > I found some code for dealing with links in annotations: > http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link > It involves checking the class. > {code:java} > if( annotation instanceof PDAnnotationLink ) { > PDAnnotationLink link = (PDAnnotationLink)annotation; > {code} > I hope this helps someone. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira