Re: extracting text from an "encrypted" pdf

Tilman Hausherr Fri, 08 May 2015 08:45:19 -0700

Am 08.05.2015 um 17:36 schrieb Clemens Wyss DEV:

When I try to extract an "encrypted" (which can be read in AcrobatReader) 
document with:


pdfDocument = PDDocument.load( is );


add
if( document.isEncrypted() )
{

StandardDecryptionMaterial sdm = new StandardDecryptionMaterial(password );

document.openProtection( sdm );
}

or use loadNonSeq()

PDFTextStripper pdfStripper = new PDFTextStripper();
parsedText = pdfStripper.getText( pdfDocument );

I get an empty string, and " o.apache.pdfbox.pdfparser.PDFParser - Document is 
encrypted" is logged.

When, on the other hand, I do:

ContentHandler handler = new BodyContentHandler( -1 );
ParseContext context = new ParseContext();
parser = new AutoDetectParser();
context.set( Parser.class, parser );
  parser.parse( is, handler, metadata, context );
parsedText = handler.toString();

I get to see the text/content of the very pdf.

1) What ist he preferred way to extract text from a 
pdf("-that-can-be-read-in-AcrobatReader")?

https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractText.java?view=markup&sortby=date

2) Does the second approach possibly return "more than text"? Blobs? Binary data?


That is TIKA, isn't it?

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: extracting text from an "encrypted" pdf

Reply via email to