Hong-Thai Nguyen created TIKA-1201:
--------------------------------------

             Summary: Add option for switching to pdfbox NonSequentialPDFParser
                 Key: TIKA-1201
                 URL: https://issues.apache.org/jira/browse/TIKA-1201
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.4
         Environment: all
            Reporter: Hong-Thai Nguyen
            Priority: Critical


As discussing, we can improve PDF extraction by 45% with this new 
NonSequentialPDFParser and fit more with PDF specification. This parser will be 
integrated by default in pdfbox 2.0.

ref.: 
https://issues.apache.org/jira/browse/PDFBOX-1104
http://pdfbox.apache.org/ideas.html

We should provide an extended parser or parameter current PDFParser to call:
{code}
PDDocument.loadNonSeq(file, scratchFile);
{code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to