[
https://issues.apache.org/jira/browse/PDFBOX-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088253#comment-13088253
]
Jeremy Villalobos commented on PDFBOX-1104:
-------------------------------------------
Yes, the main objective on this improvement is to load only the COSObjects
needed, which can be done since the PDF specification was written with this in
mine. PDFBOX-1000 does seem to differ in that it is rewriting the PDFParser
which can bring a cleaner, better designed improvement, but reading the
comments on that thread it is also encountering parsing issues that were likely
fixed by the first PDFParse implementation.
This improvement is more pragmatic, simply adding the dynamic loading to the
current PDFParser in a dirtier, pragmatic approach to adding the feature.
Therefore the feature, based on my testing, works on all PDF's that can be read
by the current parser. As I mentioned, I would like some guidance into how to
run an "official" battery of tests to validate this claim fully. One example
is the public domain "Quijote de la mancha"
http://www.google.com/url?sa=t&source=web&cd=1&ved=0CBoQFjAA&url=http%3A%2F%2Fwww.donquijote.org%2Fspanishlanguage%2Fliterature%2Flibrary%2Fquijote%2Fquijote1.pdf&ei=DAxQToD8M8zAtgeV7_mcBw&usg=AFQjCNGyui0WAmnJurwduE43dhujyTaq0g&sig2=bxArxe7Ucl92fr-oFPukjQ.
For this example I get a "java.lang.RuntimeException: Not yet implemented" on
PDFBOX-1000 while the improvement suggested here reads it ok. Here is the
implementation in case there is a step I may have skipped when using the
Conforming PDFParser
ConformingPDDocument doc = (ConformingPDDocument) ConformingPDDocument.load(
new File(test[k]) );
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage( page );
stripper.setEndPage( page );
OutputStreamWriter output = new OutputStreamWriter(
new FileOutputStream( "/path/conformaing_pdf_box_parser.txt"));;
stripper.writeText( doc, output );
output.close();
Maybe I should submit this as a bug.
I am interested in knowing the community interest on this improvement because I
may add parallel features to take advantage of dual and quad-core ARM
processors (for Android), and I would switch to Adam's implementation and add
parallelization to it if it works well.
I think this improvement work "out-of-the-box" while PDFBOX-1000 is a cleaner,
better designed approach but which may not be ready.
> Improves parsing speed of a pdf by an average of 45% when extracting text
> from one random page in the document.
> ---------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-1104
> URL: https://issues.apache.org/jira/browse/PDFBOX-1104
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing, Utilities
> Affects Versions: 1.6.0
> Reporter: Jeremy Villalobos
> Priority: Minor
> Fix For: 1.6.0
>
> Attachments: OnePagePDFTextStripper.java, PagesNotExpectedHere.java,
> ParseTester.java, QuickParser.java, fast_parser.diff
>
>
> The parser proposed just parses the minimal required from the PDF file
> according to PDF specifications. A random page can be parsed without having
> to parse the entire document first. Exist parsing code was used to transfer
> existing bugfixes and compliance fixes to this parser.
> The parser has been tested with the text extraction tool. But has not been
> tested with the viewer or other pdf tools. Some tools may need to be recoded
> to use the parser to prevent null pointer exceptions since the COSDocument
> will contain null pointers for COSObjects that have not been parsed. For
> example, the Current Text Extractor assumes the entire document is loaded.
> On this code submission a modified text extractor is also included with the
> name OnePagePDFTextStripper. The class has a function that will extract the
> text from a PDPage submitted by the programmer.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira