[ 
https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-861.
-----------------------------

    Resolution: Fixed

Thanks, patches committed in r1331434.

One thing to note is that links are extracted for now at the end of the page. 
Further work may be wanted in future, in order to match them to the text they 
apply to
                
> Parse links in PDF
> ------------------
>
>                 Key: TIKA-861
>                 URL: https://issues.apache.org/jira/browse/TIKA-861
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Sasha Goodman
>            Priority: Minor
>              Labels: links, pdfbox
>             Fix For: 1.2
>
>         Attachments: TIKA-861-test.patch, TIKA-861.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Currently the XHTML doesn't contain links, although PDFBox parses them. I'm 
> new to Tika and haven't done java for 6 years, but someone more experienced 
> could probably do this in a few hours. 
> The PDF2XHTML method loops through the annotations. 
> See: 
> {code:java}
> 136: for(Object o : page.getAnnotations()) {
> {code}
>  I found some code for dealing with links in annotations:
> http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
> It involves checking the class. 
> {code:java}
> if( annotation instanceof PDAnnotationLink ) {
>                 PDAnnotationLink link = (PDAnnotationLink)annotation;
> {code}
> I hope this helps someone.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to