[jira] [Created] (TIKA-1214) Infinity Loop in Mpeg Stream

2013-12-23 Thread Georg Hartmann (JIRA)
Georg Hartmann created TIKA-1214:


 Summary: Infinity Loop in Mpeg Stream
 Key: TIKA-1214
 URL: https://issues.apache.org/jira/browse/TIKA-1214
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: local system
Reporter: Georg Hartmann
 Fix For: 1.5


Scanning MP3 Files accounter a infiniy loop in the MpegStream Method skipStream

The Call of in.skip returnes zero so the loop never ends.

Simple fix with zero count below

private static void skipStream(InputStream in, long count) throws 
IOException {
long size = count;
long skipped = 0;
// 5 Times zero equals Error break the loop
int zeroCount = 5;
while (size  0  skipped = 0) {
skipped = in.skip(size);
if (skipped != -1) {
size -= skipped;
}

// Checking for zero to break the infinity loop
if (skipped == 0) {
zeroCount--;
}
if (zeroCount  0) {
break;
}
}
}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1152) Process loops infinitely on parsing of a CHM file

2013-12-23 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855528#comment-13855528
 ] 

Hong-Thai Nguyen commented on TIKA-1152:


[~gagravarr] or anyone can have look at patch in integrate to trunk before 
release 1.5 please ?
Merci

 Process loops infinitely on parsing of a CHM file
 -

 Key: TIKA-1152
 URL: https://issues.apache.org/jira/browse/TIKA-1152
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: Windows/Linux
Reporter: Hong-Thai Nguyen
Priority: Critical
 Fix For: 1.5

 Attachments: ChmLzxBlock.java.patch, eventcombmt.chm


 By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help 
 Files), Java process stuck.
 {code}
 Thread[main,5,main]
   
 org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
   org.apache.tika.parser.chm.lzx.ChmLzxBlock.init(ChmLzxBlock.java:77)
   
 org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)
   
 org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)
   
 org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
   org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
   org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
   org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
   org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)
   
 com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1214) Infinity Loop in Mpeg Stream

2013-12-23 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855530#comment-13855530
 ] 

Nick Burch commented on TIKA-1214:
--

Can you try with a recent nightly build? We've done a mpeg related fix fairly 
recently (the number escapes me right now), so it'd be worth checking to see if 
it's already fixed or not

 Infinity Loop in Mpeg Stream
 

 Key: TIKA-1214
 URL: https://issues.apache.org/jira/browse/TIKA-1214
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: local system
Reporter: Georg Hartmann
 Fix For: 1.5


 Scanning MP3 Files accounter a infiniy loop in the MpegStream Method 
 skipStream
 The Call of in.skip returnes zero so the loop never ends.
 Simple fix with zero count below
 private static void skipStream(InputStream in, long count) throws 
 IOException {
 long size = count;
 long skipped = 0;
 // 5 Times zero equals Error break the loop
 int zeroCount = 5;
 while (size  0  skipped = 0) {
 skipped = in.skip(size);
 if (skipped != -1) {
 size -= skipped;
 }
 
 // Checking for zero to break the infinity loop
 if (skipped == 0) {
 zeroCount--;
 }
 if (zeroCount  0) {
 break;
 }
 }
 }



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes

2013-12-23 Thread Clemens Wyss (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855679#comment-13855679
 ] 

Clemens Wyss commented on TIKA-1213:


- https://issues.apache.org/jira/browse/PDFBOX-1821 

 Parsing (extracting content) a single 5Mb pdf file takes 3minutes
 -

 Key: TIKA-1213
 URL: https://issues.apache.org/jira/browse/TIKA-1213
 Project: Tika
  Issue Type: Bug
  Components: parser
 Environment: I guess not relevant (except for the pdf file)
 + Win7 (8G memory)
 + java 6
 + jira 1.5 (and 1.5 snapshot)
Reporter: Clemens Wyss
Priority: Critical
 Attachments: takes3mins.pdf


 When I parse (extract all its content for Lucene) the attached pdf, the 
 extraction takes 3minutes. This is very much related to this very file. I 
 have others that misbehave alike, though
 My (unit testing) code looks alike:
 ...
 Metadata metadata = new Metadata();
 Parser parser = new AutoDetectParser();
 ContentHandler handler = new BodyContentHandler( -1 );
 ParseContext context = new ParseContext();
 context.set( Parser.class, parser );
 parser.parse( is, handler, metadata, context );
 returnValue = handler.toString();
 ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes

2013-12-23 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855922#comment-13855922
 ] 

Ken Krugler commented on TIKA-1213:
---

Hi Clemens - thanks for creating the issue over in PDFBox-land. So do you think 
this issue can be closed?

 Parsing (extracting content) a single 5Mb pdf file takes 3minutes
 -

 Key: TIKA-1213
 URL: https://issues.apache.org/jira/browse/TIKA-1213
 Project: Tika
  Issue Type: Bug
  Components: parser
 Environment: I guess not relevant (except for the pdf file)
 + Win7 (8G memory)
 + java 6
 + jira 1.5 (and 1.5 snapshot)
Reporter: Clemens Wyss
Priority: Critical
 Attachments: takes3mins.pdf


 When I parse (extract all its content for Lucene) the attached pdf, the 
 extraction takes 3minutes. This is very much related to this very file. I 
 have others that misbehave alike, though
 My (unit testing) code looks alike:
 ...
 Metadata metadata = new Metadata();
 Parser parser = new AutoDetectParser();
 ContentHandler handler = new BodyContentHandler( -1 );
 ParseContext context = new ParseContext();
 context.set( Parser.class, parser );
 parser.parse( is, handler, metadata, context );
 returnValue = handler.toString();
 ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes

2013-12-23 Thread Clemens Wyss (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856131#comment-13856131
 ] 

Clemens Wyss commented on TIKA-1213:


I guess, yes

 Parsing (extracting content) a single 5Mb pdf file takes 3minutes
 -

 Key: TIKA-1213
 URL: https://issues.apache.org/jira/browse/TIKA-1213
 Project: Tika
  Issue Type: Bug
  Components: parser
 Environment: I guess not relevant (except for the pdf file)
 + Win7 (8G memory)
 + java 6
 + jira 1.5 (and 1.5 snapshot)
Reporter: Clemens Wyss
Priority: Critical
 Attachments: takes3mins.pdf


 When I parse (extract all its content for Lucene) the attached pdf, the 
 extraction takes 3minutes. This is very much related to this very file. I 
 have others that misbehave alike, though
 My (unit testing) code looks alike:
 ...
 Metadata metadata = new Metadata();
 Parser parser = new AutoDetectParser();
 ContentHandler handler = new BodyContentHandler( -1 );
 ParseContext context = new ParseContext();
 context.set( Parser.class, parser );
 parser.parse( is, handler, metadata, context );
 returnValue = handler.toString();
 ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-93) OCR support

2013-12-23 Thread frank (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856214#comment-13856214
 ] 

frank commented on TIKA-93:
---

this feature is really useful and helpful.

 OCR support
 ---

 Key: TIKA-93
 URL: https://issues.apache.org/jira/browse/TIKA-93
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Priority: Minor

 I don't know of any decent open source pure Java OCR libraries, but there are 
 command line OCR tools like Tesseract 
 (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
 extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)