[jira] [Updated] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes

Clemens Wyss (JIRA) Sun, 22 Dec 2013 06:23:14 -0800

     [ 
https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Clemens Wyss updated TIKA-1213:
-------------------------------

    Attachment: takes3mins.pdf

5Mb pdf that takes 3 minutes to be parsed...

> Parsing (extracting content) a single 5Mb pdf file takes 3minutes
> -----------------------------------------------------------------
>
>                 Key: TIKA-1213
>                 URL: https://issues.apache.org/jira/browse/TIKA-1213
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>         Environment: I guess not relevant (except for the pdf file)
> + Win7 (8G memory)
> + java 6
> + jira 1.5 (and 1.5 snapshot)
>            Reporter: Clemens Wyss
>            Priority: Critical
>         Attachments: takes3mins.pdf
>
>
> When I parse (extract all its content for Lucene) the attached pdf, the 
> extraction takes 3minutes. This is very much related to this very file. I 
> have others that misbehave alike, though
> My (unit testing) code looks alike:
> ...
> Metadata metadata = new Metadata();
> Parser parser = new AutoDetectParser();
> ContentHandler handler = new BodyContentHandler( -1 );
> ParseContext context = new ParseContext();
> context.set( Parser.class, parser );
> parser.parse( is, handler, metadata, context );
> returnValue = handler.toString();
> ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes

Reply via email to