[jira] [Commented] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes
[ https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856131#comment-13856131 ] Clemens Wyss commented on TIKA-1213: I guess, yes > Parsing (extracting content) a single 5Mb pdf file takes 3minutes > - > > Key: TIKA-1213 > URL: https://issues.apache.org/jira/browse/TIKA-1213 > Project: Tika > Issue Type: Bug > Components: parser > Environment: I guess not relevant (except for the pdf file) > + Win7 (8G memory) > + java 6 > + jira 1.5 (and 1.5 snapshot) >Reporter: Clemens Wyss >Priority: Critical > Attachments: takes3mins.pdf > > > When I parse (extract all its content for Lucene) the attached pdf, the > extraction takes 3minutes. This is very much related to this very file. I > have others that misbehave alike, though > My (unit testing) code looks alike: > ... > Metadata metadata = new Metadata(); > Parser parser = new AutoDetectParser(); > ContentHandler handler = new BodyContentHandler( -1 ); > ParseContext context = new ParseContext(); > context.set( Parser.class, parser ); > parser.parse( is, handler, metadata, context ); > returnValue = handler.toString(); > ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes
[ https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855679#comment-13855679 ] Clemens Wyss commented on TIKA-1213: -> https://issues.apache.org/jira/browse/PDFBOX-1821 > Parsing (extracting content) a single 5Mb pdf file takes 3minutes > - > > Key: TIKA-1213 > URL: https://issues.apache.org/jira/browse/TIKA-1213 > Project: Tika > Issue Type: Bug > Components: parser > Environment: I guess not relevant (except for the pdf file) > + Win7 (8G memory) > + java 6 > + jira 1.5 (and 1.5 snapshot) >Reporter: Clemens Wyss >Priority: Critical > Attachments: takes3mins.pdf > > > When I parse (extract all its content for Lucene) the attached pdf, the > extraction takes 3minutes. This is very much related to this very file. I > have others that misbehave alike, though > My (unit testing) code looks alike: > ... > Metadata metadata = new Metadata(); > Parser parser = new AutoDetectParser(); > ContentHandler handler = new BodyContentHandler( -1 ); > ParseContext context = new ParseContext(); > context.set( Parser.class, parser ); > parser.parse( is, handler, metadata, context ); > returnValue = handler.toString(); > ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes
[ https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855460#comment-13855460 ] Clemens Wyss edited comment on TIKA-1213 at 12/23/13 7:20 AM: -- thanks for the hint/advice ;), just contacted the pdfbox-mailinglist. Should/can we move this issue over to the pdfbox-issues? was (Author: clemensdev): thanks for the hint/advice ;), just sent my contacted the pdfbox-mailinglist. Should/can we move this issue over to the pdfbox-issues? > Parsing (extracting content) a single 5Mb pdf file takes 3minutes > - > > Key: TIKA-1213 > URL: https://issues.apache.org/jira/browse/TIKA-1213 > Project: Tika > Issue Type: Bug > Components: parser > Environment: I guess not relevant (except for the pdf file) > + Win7 (8G memory) > + java 6 > + jira 1.5 (and 1.5 snapshot) >Reporter: Clemens Wyss >Priority: Critical > Attachments: takes3mins.pdf > > > When I parse (extract all its content for Lucene) the attached pdf, the > extraction takes 3minutes. This is very much related to this very file. I > have others that misbehave alike, though > My (unit testing) code looks alike: > ... > Metadata metadata = new Metadata(); > Parser parser = new AutoDetectParser(); > ContentHandler handler = new BodyContentHandler( -1 ); > ParseContext context = new ParseContext(); > context.set( Parser.class, parser ); > parser.parse( is, handler, metadata, context ); > returnValue = handler.toString(); > ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes
[ https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855460#comment-13855460 ] Clemens Wyss commented on TIKA-1213: thanks for the hint/advice ;), just sent my contacted the pdfbox-mailinglist. Should/can we move this issue over to the pdfbox-issues? > Parsing (extracting content) a single 5Mb pdf file takes 3minutes > - > > Key: TIKA-1213 > URL: https://issues.apache.org/jira/browse/TIKA-1213 > Project: Tika > Issue Type: Bug > Components: parser > Environment: I guess not relevant (except for the pdf file) > + Win7 (8G memory) > + java 6 > + jira 1.5 (and 1.5 snapshot) >Reporter: Clemens Wyss >Priority: Critical > Attachments: takes3mins.pdf > > > When I parse (extract all its content for Lucene) the attached pdf, the > extraction takes 3minutes. This is very much related to this very file. I > have others that misbehave alike, though > My (unit testing) code looks alike: > ... > Metadata metadata = new Metadata(); > Parser parser = new AutoDetectParser(); > ContentHandler handler = new BodyContentHandler( -1 ); > ParseContext context = new ParseContext(); > context.set( Parser.class, parser ); > parser.parse( is, handler, metadata, context ); > returnValue = handler.toString(); > ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes
[ https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Clemens Wyss updated TIKA-1213: --- Attachment: takes3mins.pdf 5Mb pdf that takes 3 minutes to be parsed... > Parsing (extracting content) a single 5Mb pdf file takes 3minutes > - > > Key: TIKA-1213 > URL: https://issues.apache.org/jira/browse/TIKA-1213 > Project: Tika > Issue Type: Bug > Components: parser > Environment: I guess not relevant (except for the pdf file) > + Win7 (8G memory) > + java 6 > + jira 1.5 (and 1.5 snapshot) >Reporter: Clemens Wyss >Priority: Critical > Attachments: takes3mins.pdf > > > When I parse (extract all its content for Lucene) the attached pdf, the > extraction takes 3minutes. This is very much related to this very file. I > have others that misbehave alike, though > My (unit testing) code looks alike: > ... > Metadata metadata = new Metadata(); > Parser parser = new AutoDetectParser(); > ContentHandler handler = new BodyContentHandler( -1 ); > ParseContext context = new ParseContext(); > context.set( Parser.class, parser ); > parser.parse( is, handler, metadata, context ); > returnValue = handler.toString(); > ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes
Clemens Wyss created TIKA-1213: -- Summary: Parsing (extracting content) a single 5Mb pdf file takes 3minutes Key: TIKA-1213 URL: https://issues.apache.org/jira/browse/TIKA-1213 Project: Tika Issue Type: Bug Components: parser Environment: I guess not relevant (except for the pdf file) + Win7 (8G memory) + java 6 + jira 1.5 (and 1.5 snapshot) Reporter: Clemens Wyss Priority: Critical When I parse (extract all its content for Lucene) the attached pdf, the extraction takes 3minutes. This is very much related to this very file. I have others that misbehave alike, though My (unit testing) code looks alike: ... Metadata metadata = new Metadata(); Parser parser = new AutoDetectParser(); ContentHandler handler = new BodyContentHandler( -1 ); ParseContext context = new ParseContext(); context.set( Parser.class, parser ); parser.parse( is, handler, metadata, context ); returnValue = handler.toString(); ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)