[jira] [Commented] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes

2013-12-23 Thread Clemens Wyss (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856131#comment-13856131
 ] 

Clemens Wyss commented on TIKA-1213:


I guess, yes

> Parsing (extracting content) a single 5Mb pdf file takes 3minutes
> -
>
> Key: TIKA-1213
> URL: https://issues.apache.org/jira/browse/TIKA-1213
> Project: Tika
>  Issue Type: Bug
>  Components: parser
> Environment: I guess not relevant (except for the pdf file)
> + Win7 (8G memory)
> + java 6
> + jira 1.5 (and 1.5 snapshot)
>Reporter: Clemens Wyss
>Priority: Critical
> Attachments: takes3mins.pdf
>
>
> When I parse (extract all its content for Lucene) the attached pdf, the 
> extraction takes 3minutes. This is very much related to this very file. I 
> have others that misbehave alike, though
> My (unit testing) code looks alike:
> ...
> Metadata metadata = new Metadata();
> Parser parser = new AutoDetectParser();
> ContentHandler handler = new BodyContentHandler( -1 );
> ParseContext context = new ParseContext();
> context.set( Parser.class, parser );
> parser.parse( is, handler, metadata, context );
> returnValue = handler.toString();
> ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes

2013-12-23 Thread Clemens Wyss (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855679#comment-13855679
 ] 

Clemens Wyss commented on TIKA-1213:


-> https://issues.apache.org/jira/browse/PDFBOX-1821 

> Parsing (extracting content) a single 5Mb pdf file takes 3minutes
> -
>
> Key: TIKA-1213
> URL: https://issues.apache.org/jira/browse/TIKA-1213
> Project: Tika
>  Issue Type: Bug
>  Components: parser
> Environment: I guess not relevant (except for the pdf file)
> + Win7 (8G memory)
> + java 6
> + jira 1.5 (and 1.5 snapshot)
>Reporter: Clemens Wyss
>Priority: Critical
> Attachments: takes3mins.pdf
>
>
> When I parse (extract all its content for Lucene) the attached pdf, the 
> extraction takes 3minutes. This is very much related to this very file. I 
> have others that misbehave alike, though
> My (unit testing) code looks alike:
> ...
> Metadata metadata = new Metadata();
> Parser parser = new AutoDetectParser();
> ContentHandler handler = new BodyContentHandler( -1 );
> ParseContext context = new ParseContext();
> context.set( Parser.class, parser );
> parser.parse( is, handler, metadata, context );
> returnValue = handler.toString();
> ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes

2013-12-22 Thread Clemens Wyss (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855460#comment-13855460
 ] 

Clemens Wyss edited comment on TIKA-1213 at 12/23/13 7:20 AM:
--

thanks for the hint/advice ;), just contacted the pdfbox-mailinglist.

Should/can we move this issue over to the pdfbox-issues?


was (Author: clemensdev):
thanks for the hint/advice ;), just sent my contacted the pdfbox-mailinglist.

Should/can we move this issue over to the pdfbox-issues?

> Parsing (extracting content) a single 5Mb pdf file takes 3minutes
> -
>
> Key: TIKA-1213
> URL: https://issues.apache.org/jira/browse/TIKA-1213
> Project: Tika
>  Issue Type: Bug
>  Components: parser
> Environment: I guess not relevant (except for the pdf file)
> + Win7 (8G memory)
> + java 6
> + jira 1.5 (and 1.5 snapshot)
>Reporter: Clemens Wyss
>Priority: Critical
> Attachments: takes3mins.pdf
>
>
> When I parse (extract all its content for Lucene) the attached pdf, the 
> extraction takes 3minutes. This is very much related to this very file. I 
> have others that misbehave alike, though
> My (unit testing) code looks alike:
> ...
> Metadata metadata = new Metadata();
> Parser parser = new AutoDetectParser();
> ContentHandler handler = new BodyContentHandler( -1 );
> ParseContext context = new ParseContext();
> context.set( Parser.class, parser );
> parser.parse( is, handler, metadata, context );
> returnValue = handler.toString();
> ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes

2013-12-22 Thread Clemens Wyss (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855460#comment-13855460
 ] 

Clemens Wyss commented on TIKA-1213:


thanks for the hint/advice ;), just sent my contacted the pdfbox-mailinglist.

Should/can we move this issue over to the pdfbox-issues?

> Parsing (extracting content) a single 5Mb pdf file takes 3minutes
> -
>
> Key: TIKA-1213
> URL: https://issues.apache.org/jira/browse/TIKA-1213
> Project: Tika
>  Issue Type: Bug
>  Components: parser
> Environment: I guess not relevant (except for the pdf file)
> + Win7 (8G memory)
> + java 6
> + jira 1.5 (and 1.5 snapshot)
>Reporter: Clemens Wyss
>Priority: Critical
> Attachments: takes3mins.pdf
>
>
> When I parse (extract all its content for Lucene) the attached pdf, the 
> extraction takes 3minutes. This is very much related to this very file. I 
> have others that misbehave alike, though
> My (unit testing) code looks alike:
> ...
> Metadata metadata = new Metadata();
> Parser parser = new AutoDetectParser();
> ContentHandler handler = new BodyContentHandler( -1 );
> ParseContext context = new ParseContext();
> context.set( Parser.class, parser );
> parser.parse( is, handler, metadata, context );
> returnValue = handler.toString();
> ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes

2013-12-22 Thread Clemens Wyss (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clemens Wyss updated TIKA-1213:
---

Attachment: takes3mins.pdf

5Mb pdf that takes 3 minutes to be parsed...

> Parsing (extracting content) a single 5Mb pdf file takes 3minutes
> -
>
> Key: TIKA-1213
> URL: https://issues.apache.org/jira/browse/TIKA-1213
> Project: Tika
>  Issue Type: Bug
>  Components: parser
> Environment: I guess not relevant (except for the pdf file)
> + Win7 (8G memory)
> + java 6
> + jira 1.5 (and 1.5 snapshot)
>Reporter: Clemens Wyss
>Priority: Critical
> Attachments: takes3mins.pdf
>
>
> When I parse (extract all its content for Lucene) the attached pdf, the 
> extraction takes 3minutes. This is very much related to this very file. I 
> have others that misbehave alike, though
> My (unit testing) code looks alike:
> ...
> Metadata metadata = new Metadata();
> Parser parser = new AutoDetectParser();
> ContentHandler handler = new BodyContentHandler( -1 );
> ParseContext context = new ParseContext();
> context.set( Parser.class, parser );
> parser.parse( is, handler, metadata, context );
> returnValue = handler.toString();
> ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes

2013-12-22 Thread Clemens Wyss (JIRA)
Clemens Wyss created TIKA-1213:
--

 Summary: Parsing (extracting content) a single 5Mb pdf file takes 
3minutes
 Key: TIKA-1213
 URL: https://issues.apache.org/jira/browse/TIKA-1213
 Project: Tika
  Issue Type: Bug
  Components: parser
 Environment: I guess not relevant (except for the pdf file)
+ Win7 (8G memory)
+ java 6
+ jira 1.5 (and 1.5 snapshot)
Reporter: Clemens Wyss
Priority: Critical


When I parse (extract all its content for Lucene) the attached pdf, the 
extraction takes 3minutes. This is very much related to this very file. I have 
others that misbehave alike, though

My (unit testing) code looks alike:
...
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler( -1 );
ParseContext context = new ParseContext();
context.set( Parser.class, parser );
parser.parse( is, handler, metadata, context );
returnValue = handler.toString();
...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)