[jira] [Commented] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes

Ken Krugler (JIRA) Sun, 22 Dec 2013 07:56:12 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855197#comment-13855197
 ]


Ken Krugler commented on TIKA-1213:
-----------------------------------

Hi Clemens - since Tika just "wraps" the PDFBox parser for processing PDFs, I 
think a good first step would be to post on the PDFBox mailing list, asking if 
certain types of PDFs are known to cause this type of performance problem. It 
could be that this is a known issue without a good solution, unfortunately.

> Parsing (extracting content) a single 5Mb pdf file takes 3minutes
> -----------------------------------------------------------------
>
>                 Key: TIKA-1213
>                 URL: https://issues.apache.org/jira/browse/TIKA-1213
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>         Environment: I guess not relevant (except for the pdf file)
> + Win7 (8G memory)
> + java 6
> + jira 1.5 (and 1.5 snapshot)
>            Reporter: Clemens Wyss
>            Priority: Critical
>         Attachments: takes3mins.pdf
>
>
> When I parse (extract all its content for Lucene) the attached pdf, the 
> extraction takes 3minutes. This is very much related to this very file. I 
> have others that misbehave alike, though
> My (unit testing) code looks alike:
> ...
> Metadata metadata = new Metadata();
> Parser parser = new AutoDetectParser();
> ContentHandler handler = new BodyContentHandler( -1 );
> ParseContext context = new ParseContext();
> context.set( Parser.class, parser );
> parser.parse( is, handler, metadata, context );
> returnValue = handler.toString();
> ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes

Reply via email to