Hong-Thai Nguyen created TIKA-1201: -------------------------------------- Summary: Add option for switching to pdfbox NonSequentialPDFParser Key: TIKA-1201 URL: https://issues.apache.org/jira/browse/TIKA-1201 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.4 Environment: all Reporter: Hong-Thai Nguyen Priority: Critical
As discussing, we can improve PDF extraction by 45% with this new NonSequentialPDFParser and fit more with PDF specification. This parser will be integrated by default in pdfbox 2.0. ref.: https://issues.apache.org/jira/browse/PDFBOX-1104 http://pdfbox.apache.org/ideas.html We should provide an extended parser or parameter current PDFParser to call: {code} PDDocument.loadNonSeq(file, scratchFile); {code} -- This message was sent by Atlassian JIRA (v6.1#6144)