Text Extraction Does Not Extract Content Beyond First Page
----------------------------------------------------------
Key: PDFBOX-413
URL: https://issues.apache.org/jira/browse/PDFBOX-413
Project: PDFBox
Issue Type: Bug
Environment: Ubuntu, OpenJDK 6
Reporter: alvin
Such as my attempt to extract plain text from PDF using PDFBOX:
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage( 1);
stripper.setEndPage( 5 );
LucenePDFDocument document = new LucenePDFDocument();
Document luceneDocument = document.convertDocument(file);
System.out.println("CONTENTS: "+luceneDocument.get("contents"));
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
This is the result I get, and it never goes beyond page 1:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Document<stored/uncompressed<path:/home/alvin/Desktop/google.pdf>
stored/uncompressed<url:/home/alvin/Desktop/google.pdf>
stored/uncompressed,indexed<modified:20090130112759> indexed<uid:
Web Search Engine
Sergey Brin and Lawrence Page
Computer Science Department,
Stanford University, Stanford, CA 94305, USA
[email protected] and [email protected]
Abstract
In this paper, we present Google, a prototype of a large-scale search engine
which makes heavy
use of the structure present in hypertext. Google is designed to crawl and
index the Web efficiently
and produce much more satisfying search results than existing systems. The
proto>>
Is it Bug?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.