[jira] [Created] (TIKA-1214) Infinity Loop in Mpeg Stream
Georg Hartmann created TIKA-1214: Summary: Infinity Loop in Mpeg Stream Key: TIKA-1214 URL: https://issues.apache.org/jira/browse/TIKA-1214 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: local system Reporter: Georg Hartmann Fix For: 1.5 Scanning MP3 Files accounter a infiniy loop in the MpegStream Method skipStream The Call of in.skip returnes zero so the loop never ends. Simple fix with zero count below private static void skipStream(InputStream in, long count) throws IOException { long size = count; long skipped = 0; // 5 Times zero equals Error break the loop int zeroCount = 5; while (size 0 skipped = 0) { skipped = in.skip(size); if (skipped != -1) { size -= skipped; } // Checking for zero to break the infinity loop if (skipped == 0) { zeroCount--; } if (zeroCount 0) { break; } } } -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1152) Process loops infinitely on parsing of a CHM file
[ https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855528#comment-13855528 ] Hong-Thai Nguyen commented on TIKA-1152: [~gagravarr] or anyone can have look at patch in integrate to trunk before release 1.5 please ? Merci Process loops infinitely on parsing of a CHM file - Key: TIKA-1152 URL: https://issues.apache.org/jira/browse/TIKA-1152 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: Windows/Linux Reporter: Hong-Thai Nguyen Priority: Critical Fix For: 1.5 Attachments: ChmLzxBlock.java.patch, eventcombmt.chm By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help Files), Java process stuck. {code} Thread[main,5,main] org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203) org.apache.tika.parser.chm.lzx.ChmLzxBlock.init(ChmLzxBlock.java:77) org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338) org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72) org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141) org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34) org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51) org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53) com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192) ... {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1214) Infinity Loop in Mpeg Stream
[ https://issues.apache.org/jira/browse/TIKA-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855530#comment-13855530 ] Nick Burch commented on TIKA-1214: -- Can you try with a recent nightly build? We've done a mpeg related fix fairly recently (the number escapes me right now), so it'd be worth checking to see if it's already fixed or not Infinity Loop in Mpeg Stream Key: TIKA-1214 URL: https://issues.apache.org/jira/browse/TIKA-1214 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Environment: local system Reporter: Georg Hartmann Fix For: 1.5 Scanning MP3 Files accounter a infiniy loop in the MpegStream Method skipStream The Call of in.skip returnes zero so the loop never ends. Simple fix with zero count below private static void skipStream(InputStream in, long count) throws IOException { long size = count; long skipped = 0; // 5 Times zero equals Error break the loop int zeroCount = 5; while (size 0 skipped = 0) { skipped = in.skip(size); if (skipped != -1) { size -= skipped; } // Checking for zero to break the infinity loop if (skipped == 0) { zeroCount--; } if (zeroCount 0) { break; } } } -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes
[ https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855679#comment-13855679 ] Clemens Wyss commented on TIKA-1213: - https://issues.apache.org/jira/browse/PDFBOX-1821 Parsing (extracting content) a single 5Mb pdf file takes 3minutes - Key: TIKA-1213 URL: https://issues.apache.org/jira/browse/TIKA-1213 Project: Tika Issue Type: Bug Components: parser Environment: I guess not relevant (except for the pdf file) + Win7 (8G memory) + java 6 + jira 1.5 (and 1.5 snapshot) Reporter: Clemens Wyss Priority: Critical Attachments: takes3mins.pdf When I parse (extract all its content for Lucene) the attached pdf, the extraction takes 3minutes. This is very much related to this very file. I have others that misbehave alike, though My (unit testing) code looks alike: ... Metadata metadata = new Metadata(); Parser parser = new AutoDetectParser(); ContentHandler handler = new BodyContentHandler( -1 ); ParseContext context = new ParseContext(); context.set( Parser.class, parser ); parser.parse( is, handler, metadata, context ); returnValue = handler.toString(); ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes
[ https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855922#comment-13855922 ] Ken Krugler commented on TIKA-1213: --- Hi Clemens - thanks for creating the issue over in PDFBox-land. So do you think this issue can be closed? Parsing (extracting content) a single 5Mb pdf file takes 3minutes - Key: TIKA-1213 URL: https://issues.apache.org/jira/browse/TIKA-1213 Project: Tika Issue Type: Bug Components: parser Environment: I guess not relevant (except for the pdf file) + Win7 (8G memory) + java 6 + jira 1.5 (and 1.5 snapshot) Reporter: Clemens Wyss Priority: Critical Attachments: takes3mins.pdf When I parse (extract all its content for Lucene) the attached pdf, the extraction takes 3minutes. This is very much related to this very file. I have others that misbehave alike, though My (unit testing) code looks alike: ... Metadata metadata = new Metadata(); Parser parser = new AutoDetectParser(); ContentHandler handler = new BodyContentHandler( -1 ); ParseContext context = new ParseContext(); context.set( Parser.class, parser ); parser.parse( is, handler, metadata, context ); returnValue = handler.toString(); ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes
[ https://issues.apache.org/jira/browse/TIKA-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856131#comment-13856131 ] Clemens Wyss commented on TIKA-1213: I guess, yes Parsing (extracting content) a single 5Mb pdf file takes 3minutes - Key: TIKA-1213 URL: https://issues.apache.org/jira/browse/TIKA-1213 Project: Tika Issue Type: Bug Components: parser Environment: I guess not relevant (except for the pdf file) + Win7 (8G memory) + java 6 + jira 1.5 (and 1.5 snapshot) Reporter: Clemens Wyss Priority: Critical Attachments: takes3mins.pdf When I parse (extract all its content for Lucene) the attached pdf, the extraction takes 3minutes. This is very much related to this very file. I have others that misbehave alike, though My (unit testing) code looks alike: ... Metadata metadata = new Metadata(); Parser parser = new AutoDetectParser(); ContentHandler handler = new BodyContentHandler( -1 ); ParseContext context = new ParseContext(); context.set( Parser.class, parser ); parser.parse( is, handler, metadata, context ); returnValue = handler.toString(); ... -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856214#comment-13856214 ] frank commented on TIKA-93: --- this feature is really useful and helpful. OCR support --- Key: TIKA-93 URL: https://issues.apache.org/jira/browse/TIKA-93 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Priority: Minor I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.1.5#6160)