[jira] Commented: (PDFBOX-413) Text Extraction Does Not Extract Content Beyond First Page

Adrian Romano (JIRA) Fri, 30 Jan 2009 05:24:24 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668860#action_12668860
 ]


Adrian Romano commented on PDFBOX-413:
--------------------------------------

If you are just trying to extract text, I don't see any reason to use the 
LucenePDFDocument class. The following code will get the text from the google 
PDF:

PDDocument document = PDDocument.load( "c:\\google.pdf");
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition( false );
stripper.setStartPage( 1 );
stripper.setEndPage( Integer.MAX_VALUE );
                    
OutputStreamWriter writer = new OutputStreamWriter(new 
FileOutputStream("c:\\google.txt"));
stripper.writeText( document, writer );
writer.close();
document.close();

You can also look at the ExtractText.java. If you are actually trying to use 
the Lucene stuff, then I can't help you.

> Text Extraction Does Not Extract Content Beyond First Page
> ----------------------------------------------------------
>
>                 Key: PDFBOX-413
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-413
>             Project: PDFBox
>          Issue Type: Bug
>         Environment: Ubuntu, OpenJDK 6
>            Reporter: alvin
>         Attachments: google.pdf
>
>
> Such as my attempt to extract plain text from PDF using PDFBOX:
>       PDFTextStripper stripper = new PDFTextStripper();
>       stripper.setStartPage( 1);
>       stripper.setEndPage( 5 );
>       LucenePDFDocument document = new LucenePDFDocument();
>       Document luceneDocument = document.convertDocument(file);
>         System.out.println("CONTENTS: "+luceneDocument.get("contents"));
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
> This is the result I get, and it never goes beyond page 1:
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Document<stored/uncompressed<path:/home/alvin/Desktop/google.pdf> 
> stored/uncompressed<url:/home/alvin/Desktop/google.pdf> 
> stored/uncompressed,indexed<modified:20090130112759> indexed<uid:
> Web Search Engine
> Sergey Brin and Lawrence Page 
> Computer Science Department,
> Stanford University, Stanford, CA 94305, USA
> [email protected] and [email protected] 
> Abstract 
> In this paper, we present Google, a prototype of a large-scale search engine 
> which makes heavy
> use of the structure present in hypertext. Google is designed to crawl and 
> index the Web efficiently
> and produce much more satisfying search results than existing systems. The 
> proto>>
> Is it Bug?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-413) Text Extraction Does Not Extract Content Beyond First Page

Reply via email to