Lior created TIKA-2702:
--------------------------

             Summary: Different behavior between TIKA and pdfbox
                 Key: TIKA-2702
                 URL: https://issues.apache.org/jira/browse/TIKA-2702
             Project: Tika
          Issue Type: Bug
          Components: app
    Affects Versions: 1.18
            Reporter: Lior


As far as I understand, TIKA is using pdfbox for extracting text from pdf files

During a side benchmark I'm doing, I'm seeing that the text I'm getting using 
PDFBox 2.0.9 and the text I'm getting from TIKA is not 100% the same...in most 
cases, when there is a hyperlink inside the pdf file, the pdfbox ignore the 
link itself, while TIKA is extracting the text, for example:

https://www.linkedin.com/in/jhonDo
mailto:[[email protected] |mailto:[email protected]]

 

This is really a deal breaker for me, because I'm using pdfbox for another 
process I'm doing and I need the text to be the same, so I can't use TIKA at 
the moment....



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to