[ https://issues.apache.org/jira/browse/PDFBOX-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782359#comment-16782359 ]
Tilman Hausherr edited comment on PDFBOX-4477 at 3/2/19 12:40 PM: ------------------------------------------------------------------ Yes the file is encrypted, and the permissions probably prohibit text extraction, this is very common. I didn't try text extraction myself, I just looked at the file with PDFDebugger. You can still extract text programmatically, or use the owner password with the command line utility. was (Author: tilman): Yes the file is encrypted, and the permissions probably prohibit that. I didn't try text extraction, I just looked at the file with PDFDebugger. You can still extract text programmatically, or use the owner password with the command line utility. > Large encrypted file takes days to be parsed > -------------------------------------------- > > Key: PDFBOX-4477 > URL: https://issues.apache.org/jira/browse/PDFBOX-4477 > Project: PDFBox > Issue Type: Bug > Components: Crypto, Parsing > Affects Versions: 2.0.14 > Reporter: Tilman Hausherr > Assignee: Tilman Hausherr > Priority: Major > Labels: optimization > Fix For: 2.0.15, 3.0.0 PDFBox > > > As reported by [~slavago] in TIKA-2832. File is confidential but I have it. > Initial findings: > - File is AES256 encrypted with empty user password > - File has about 1000 objects > - File is a tagged PDF > - HashMap in SecurityHandler grows to 100000?! > - Using an IdentityHashMap speeds up the process dramatically (parsed in a > few seconds), and it may also be a better solution that what was done in > PDFBOX-4453 > Todo: > - Read description of IdentityHashMap again > - Find out why the HashMap grows so much. Could it be that identical objects > are stored twice? Or does the file have many direct objects? -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org