[ https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Clemens Wyss updated TIKA-1213: ------------------------------- Attachment: takes3mins.pdf 5Mb pdf that takes 3 minutes to be parsed... > Parsing (extracting content) a single 5Mb pdf file takes 3minutes > ----------------------------------------------------------------- > > Key: TIKA-1213 > URL: https://issues.apache.org/jira/browse/TIKA-1213 > Project: Tika > Issue Type: Bug > Components: parser > Environment: I guess not relevant (except for the pdf file) > + Win7 (8G memory) > + java 6 > + jira 1.5 (and 1.5 snapshot) > Reporter: Clemens Wyss > Priority: Critical > Attachments: takes3mins.pdf > > > When I parse (extract all its content for Lucene) the attached pdf, the > extraction takes 3minutes. This is very much related to this very file. I > have others that misbehave alike, though > My (unit testing) code looks alike: > ... > Metadata metadata = new Metadata(); > Parser parser = new AutoDetectParser(); > ContentHandler handler = new BodyContentHandler( -1 ); > ParseContext context = new ParseContext(); > context.set( Parser.class, parser ); > parser.parse( is, handler, metadata, context ); > returnValue = handler.toString(); > ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)