[jira] [Commented] (PDFBOX-1104) Improves parsing speed of a pdf by an average of 45% when extracting text from one random page in the document.

JIRA Fri, 19 Aug 2011 22:52:13 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088143#comment-13088143
 ]


Andreas Lehmkühler commented on PDFBOX-1104:
--------------------------------------------

I didn't have a look a the sources but the description sounds like Adams 
approach to implement a conforming parser PDFBOX-1000

> Improves parsing speed of a pdf by an average of 45% when extracting text 
> from one random page in the document.
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1104
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1104
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, Utilities
>    Affects Versions: 1.6.0
>            Reporter: Jeremy Villalobos
>            Priority: Minor
>             Fix For: 1.6.0
>
>         Attachments: OnePagePDFTextStripper.java, PagesNotExpectedHere.java, 
> ParseTester.java, QuickParser.java, fast_parser.diff
>
>
> The parser proposed just parses the minimal required from the PDF file 
> according to PDF specifications.  A random page can be parsed without having 
> to parse the entire document first.  Exist parsing code was used to transfer 
> existing bugfixes and compliance fixes to this parser.
> The parser has been tested with the text extraction tool.  But has not been 
> tested with the viewer or other pdf tools.  Some tools may need to be recoded 
> to use the parser to prevent null pointer exceptions since the COSDocument 
> will contain null pointers for COSObjects that have not been parsed.  For 
> example, the Current Text Extractor assumes the entire document is loaded.  
> On this code submission a modified text extractor is also included with the 
> name OnePagePDFTextStripper.  The class has a function that will extract the 
> text from a PDPage submitted by the programmer.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1104) Improves parsing speed of a pdf by an average of 45% when extracting text from one random page in the document.

Reply via email to