[jira] [Commented] (PDFBOX-1821) Parsing (extracting content) a single 5Mb pdf file takes 3minutes

Clemens Wyss (JIRA) Mon, 23 Dec 2013 20:49:12 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856135#comment-13856135
 ]


Clemens Wyss commented on PDFBOX-1821:
--------------------------------------

>it's not allowed to extract the content
where/how (Java code?) do I get these information? So the content is 
"encrypted", or "obfuscated" to complicate the parsing?

> Parsing (extracting content) a single 5Mb pdf file takes 3minutes
> -----------------------------------------------------------------
>
>                 Key: PDFBOX-1821
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1821
>             Project: PDFBox
>          Issue Type: Bug
>         Environment: Win7 (8G memory)
> Java 6
>            Reporter: Clemens Wyss
>         Attachments: takes3mins.pdf
>
>
> When I try to extract the attached pdf-file with the following code:
> ...
> PDFTextStripper stripper = new PDFTextStripper();
> OutputStream os = null;
> Writer writer = null;
> PDDocument document = null;
> File file = new File( "takes3mins.pdf" );
> ...
>             document = PDDocument.load(file );
>  
>             File outFile = new File("c:/tmp/gugus.txt");
>             os = new FileOutputStream(outFile);
>             writer = new OutputStreamWriter(os);
>  
>             stripper.writeText(document, writer);
> ...
> it takes approx 3minutes. Opening it in AcrobatReader in a few seconds.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1821) Parsing (extracting content) a single 5Mb pdf file takes 3minutes

Reply via email to