[jira] Resolved: (PDFBOX-413) Text Extraction Does Not Extract Content Beyond First Page

JIRA Fri, 30 Jan 2009 07:57:24 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andreas Lehmkühler resolved PDFBOX-413.
---------------------------------------

    Resolution: Invalid

Thanks for your help Adrian

> Text Extraction Does Not Extract Content Beyond First Page
> ----------------------------------------------------------
>
>                 Key: PDFBOX-413
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-413
>             Project: PDFBox
>          Issue Type: Bug
>         Environment: Ubuntu, OpenJDK 6
>            Reporter: alvin
>         Attachments: google.pdf
>
>
> Such as my attempt to extract plain text from PDF using PDFBOX:
>       PDFTextStripper stripper = new PDFTextStripper();
>       stripper.setStartPage( 1);
>       stripper.setEndPage( 5 );
>       LucenePDFDocument document = new LucenePDFDocument();
>       Document luceneDocument = document.convertDocument(file);
>         System.out.println("CONTENTS: "+luceneDocument.get("contents"));
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
> This is the result I get, and it never goes beyond page 1:
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Document<stored/uncompressed<path:/home/alvin/Desktop/google.pdf> 
> stored/uncompressed<url:/home/alvin/Desktop/google.pdf> 
> stored/uncompressed,indexed<modified:20090130112759> indexed<uid:
> Web Search Engine
> Sergey Brin and Lawrence Page 
> Computer Science Department,
> Stanford University, Stanford, CA 94305, USA
> [email protected] and [email protected] 
> Abstract 
> In this paper, we present Google, a prototype of a large-scale search engine 
> which makes heavy
> use of the structure present in hypertext. Google is designed to crawl and 
> index the Web efficiently
> and produce much more satisfying search results than existing systems. The 
> proto>>
> Is it Bug?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PDFBOX-413) Text Extraction Does Not Extract Content Beyond First Page

Reply via email to