When I try to extract an "encrypted" (which can be read in AcrobatReader) 
document with:

pdfDocument = PDDocument.load( is );
PDFTextStripper pdfStripper = new PDFTextStripper(); 
parsedText = pdfStripper.getText( pdfDocument );

I get an empty string, and " o.apache.pdfbox.pdfparser.PDFParser - Document is 
encrypted" is logged.

When, on the other hand, I do:

ContentHandler handler = new BodyContentHandler( -1 ); 
ParseContext context = new ParseContext(); 
parser = new AutoDetectParser(); 
context.set( Parser.class, parser );
 parser.parse( is, handler, metadata, context ); 
parsedText = handler.toString();

I get to see the text/content of the very pdf. 

1) What ist he preferred way to extract text from a 
pdf("-that-can-be-read-in-AcrobatReader")? 
2) Does the second approach possibly return "more than text"? Blobs? Binary 
data?

Reply via email to