[ 
https://issues.apache.org/jira/browse/PDFBOX-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Villalobos updated PDFBOX-1104:
--------------------------------------

    Attachment: fast_parser.diff
                OnePagePDFTextStripper.java
                PagesNotExpectedHere.java
                ParseTester.java
                QuickParser.java

Attashed are the files and the patch to use the partial parser.

> Improves parsing speed of a pdf by an average of 45% when extracting text 
> from one random page in the document.
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1104
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1104
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, Utilities
>    Affects Versions: 1.6.0
>            Reporter: Jeremy Villalobos
>            Priority: Minor
>             Fix For: 1.6.0
>
>         Attachments: OnePagePDFTextStripper.java, PagesNotExpectedHere.java, 
> ParseTester.java, QuickParser.java, fast_parser.diff
>
>
> The parser proposed just parses the minimal required from the PDF file 
> according to PDF specifications.  A random page can be parsed without having 
> to parse the entire document first.  Exist parsing code was used to transfer 
> existing bugfixes and compliance fixes to this parser.
> The parser has been tested with the text extraction tool.  But has not been 
> tested with the viewer or other pdf tools.  Some tools may need to be recoded 
> to use the parser to prevent null pointer exceptions since the COSDocument 
> will contain null pointers for COSObjects that have not been parsed.  For 
> example, the Current Text Extractor assumes the entire document is loaded.  
> On this code submission a modified text extractor is also included with the 
> name OnePagePDFTextStripper.  The class has a function that will extract the 
> text from a PDPage submitted by the programmer.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to