[jira] [Commented] (PDFBOX-1104) Improves parsing speed of a pdf by an average of 45% when extracting text from one random page in the document.

Jeremy Villalobos (JIRA) Sat, 20 Aug 2011 12:47:54 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088253#comment-13088253
 ]


Jeremy Villalobos commented on PDFBOX-1104:
-------------------------------------------

Yes, the main objective on this improvement is to load only the COSObjects 
needed, which can be done since the PDF specification was written with this in 
mine.  PDFBOX-1000 does seem to differ in that it is rewriting the PDFParser 
which can bring a cleaner, better designed improvement, but reading the 
comments on that thread it is also encountering parsing issues that were likely 
fixed by the first PDFParse implementation.

This improvement is more pragmatic, simply adding the dynamic loading to the 
current PDFParser in a dirtier, pragmatic approach to adding the feature.  
Therefore the feature, based on my testing, works on all PDF's that can be read 
by the current parser.  As I mentioned, I would like some guidance into how to 
run an "official" battery of tests to validate this claim fully.  One example 
is the public domain "Quijote de la mancha"  
http://www.google.com/url?sa=t&source=web&cd=1&ved=0CBoQFjAA&url=http%3A%2F%2Fwww.donquijote.org%2Fspanishlanguage%2Fliterature%2Flibrary%2Fquijote%2Fquijote1.pdf&ei=DAxQToD8M8zAtgeV7_mcBw&usg=AFQjCNGyui0WAmnJurwduE43dhujyTaq0g&sig2=bxArxe7Ucl92fr-oFPukjQ.
  For this example I get a "java.lang.RuntimeException: Not yet implemented" on 
PDFBOX-1000 while the improvement suggested here reads it ok.  Here is the 
implementation in case there is a step I may have skipped when using the 
Conforming PDFParser

ConformingPDDocument doc = (ConformingPDDocument) ConformingPDDocument.load( 
new File(test[k]) );
PDFTextStripper stripper = new PDFTextStripper();    
stripper.setStartPage( page );
stripper.setEndPage( page );
OutputStreamWriter output = new OutputStreamWriter(
        new FileOutputStream( "/path/conformaing_pdf_box_parser.txt"));;
stripper.writeText( doc, output );
output.close();

Maybe I should submit this as a bug.  

I am interested in knowing the community interest on this improvement because I 
may add parallel features to take advantage of dual and quad-core ARM 
processors (for Android), and I would switch to Adam's implementation and add 
parallelization to it if it works well.

I think this improvement work "out-of-the-box" while PDFBOX-1000 is a cleaner, 
better designed approach but which may not be ready.

> Improves parsing speed of a pdf by an average of 45% when extracting text 
> from one random page in the document.
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1104
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1104
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, Utilities
>    Affects Versions: 1.6.0
>            Reporter: Jeremy Villalobos
>            Priority: Minor
>             Fix For: 1.6.0
>
>         Attachments: OnePagePDFTextStripper.java, PagesNotExpectedHere.java, 
> ParseTester.java, QuickParser.java, fast_parser.diff
>
>
> The parser proposed just parses the minimal required from the PDF file 
> according to PDF specifications.  A random page can be parsed without having 
> to parse the entire document first.  Exist parsing code was used to transfer 
> existing bugfixes and compliance fixes to this parser.
> The parser has been tested with the text extraction tool.  But has not been 
> tested with the viewer or other pdf tools.  Some tools may need to be recoded 
> to use the parser to prevent null pointer exceptions since the COSDocument 
> will contain null pointers for COSObjects that have not been parsed.  For 
> example, the Current Text Extractor assumes the entire document is loaded.  
> On this code submission a modified text extractor is also included with the 
> name OnePagePDFTextStripper.  The class has a function that will extract the 
> text from a PDPage submitted by the programmer.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1104) Improves parsing speed of a pdf by an average of 45% when extracting text from one random page in the document.

Reply via email to