[
https://issues.apache.org/jira/browse/PDFBOX-413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668860#action_12668860
]
Adrian Romano commented on PDFBOX-413:
--------------------------------------
If you are just trying to extract text, I don't see any reason to use the
LucenePDFDocument class. The following code will get the text from the google
PDF:
PDDocument document = PDDocument.load( "c:\\google.pdf");
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition( false );
stripper.setStartPage( 1 );
stripper.setEndPage( Integer.MAX_VALUE );
OutputStreamWriter writer = new OutputStreamWriter(new
FileOutputStream("c:\\google.txt"));
stripper.writeText( document, writer );
writer.close();
document.close();
You can also look at the ExtractText.java. If you are actually trying to use
the Lucene stuff, then I can't help you.
> Text Extraction Does Not Extract Content Beyond First Page
> ----------------------------------------------------------
>
> Key: PDFBOX-413
> URL: https://issues.apache.org/jira/browse/PDFBOX-413
> Project: PDFBox
> Issue Type: Bug
> Environment: Ubuntu, OpenJDK 6
> Reporter: alvin
> Attachments: google.pdf
>
>
> Such as my attempt to extract plain text from PDF using PDFBOX:
> PDFTextStripper stripper = new PDFTextStripper();
> stripper.setStartPage( 1);
> stripper.setEndPage( 5 );
> LucenePDFDocument document = new LucenePDFDocument();
> Document luceneDocument = document.convertDocument(file);
> System.out.println("CONTENTS: "+luceneDocument.get("contents"));
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
> This is the result I get, and it never goes beyond page 1:
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Document<stored/uncompressed<path:/home/alvin/Desktop/google.pdf>
> stored/uncompressed<url:/home/alvin/Desktop/google.pdf>
> stored/uncompressed,indexed<modified:20090130112759> indexed<uid:
> Web Search Engine
> Sergey Brin and Lawrence Page
> Computer Science Department,
> Stanford University, Stanford, CA 94305, USA
> [email protected] and [email protected]
> Abstract
> In this paper, we present Google, a prototype of a large-scale search engine
> which makes heavy
> use of the structure present in hypertext. Google is designed to crawl and
> index the Web efficiently
> and produce much more satisfying search results than existing systems. The
> proto>>
> Is it Bug?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.