[
https://issues.apache.org/jira/browse/PDFBOX-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088143#comment-13088143
]
Andreas Lehmkühler commented on PDFBOX-1104:
--------------------------------------------
I didn't have a look a the sources but the description sounds like Adams
approach to implement a conforming parser PDFBOX-1000
> Improves parsing speed of a pdf by an average of 45% when extracting text
> from one random page in the document.
> ---------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-1104
> URL: https://issues.apache.org/jira/browse/PDFBOX-1104
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing, Utilities
> Affects Versions: 1.6.0
> Reporter: Jeremy Villalobos
> Priority: Minor
> Fix For: 1.6.0
>
> Attachments: OnePagePDFTextStripper.java, PagesNotExpectedHere.java,
> ParseTester.java, QuickParser.java, fast_parser.diff
>
>
> The parser proposed just parses the minimal required from the PDF file
> according to PDF specifications. A random page can be parsed without having
> to parse the entire document first. Exist parsing code was used to transfer
> existing bugfixes and compliance fixes to this parser.
> The parser has been tested with the text extraction tool. But has not been
> tested with the viewer or other pdf tools. Some tools may need to be recoded
> to use the parser to prevent null pointer exceptions since the COSDocument
> will contain null pointers for COSObjects that have not been parsed. For
> example, the Current Text Extractor assumes the entire document is loaded.
> On this code submission a modified text extractor is also included with the
> name OnePagePDFTextStripper. The class has a function that will extract the
> text from a PDPage submitted by the programmer.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira