[jira] [Commented] (PDFBOX-1104) Improves parsing speed of a pdf by an average of 45% when extracting text from one random page in the document.

Mel Martinez (JIRA) Thu, 18 Aug 2011 14:10:53 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087287#comment-13087287
 ]


Mel Martinez commented on PDFBOX-1104:
--------------------------------------

How does this impact the performance versus say, using the 
PDFTextStripper.setStartPage(int) & PDFTextStripper.setEndPage(int) methods?

Those already skip the text extraction for each page up until range specified.  
 They don't avoid some preamble object creation for each page (i.e. it creates 
a 'PDPage' object), but if I recall the profiling I did last year, the vast 
bulk of the processing time is within the scope of the 
PDTextStripper.processPage() method - which is basically skipped over for each 
page outside the start/end range.

So I'd be interested to know what your proposal is adding here. 

> Improves parsing speed of a pdf by an average of 45% when extracting text 
> from one random page in the document.
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1104
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1104
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, Utilities
>    Affects Versions: 1.6.0
>            Reporter: Jeremy Villalobos
>            Priority: Minor
>             Fix For: 1.6.0
>
>         Attachments: OnePagePDFTextStripper.java, PagesNotExpectedHere.java, 
> ParseTester.java, QuickParser.java, fast_parser.diff
>
>
> The parser proposed just parses the minimal required from the PDF file 
> according to PDF specifications.  A random page can be parsed without having 
> to parse the entire document first.  Exist parsing code was used to transfer 
> existing bugfixes and compliance fixes to this parser.
> The parser has been tested with the text extraction tool.  But has not been 
> tested with the viewer or other pdf tools.  Some tools may need to be recoded 
> to use the parser to prevent null pointer exceptions since the COSDocument 
> will contain null pointers for COSObjects that have not been parsed.  For 
> example, the Current Text Extractor assumes the entire document is loaded.  
> On this code submission a modified text extractor is also included with the 
> name OnePagePDFTextStripper.  The class has a function that will extract the 
> text from a PDPage submitted by the programmer.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1104) Improves parsing speed of a pdf by an average of 45% when extracting text from one random page in the document.

Reply via email to