[ https://issues.apache.org/jira/browse/TIKA-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13587284#comment-13587284 ]
Ken Krugler commented on TIKA-1089: ----------------------------------- Hi Hong-Thai, I took a quick look at crawler.log (thanks for attaching that file), and these are failures thrown by the underlying parsing libraries used by Tika. For example: Caused by: java.lang.ArrayIndexOutOfBoundsException: 140 at org.apache.poi.hslf.usermodel.SlideShow.buildSlidesAndNotes(SlideShow.java:405) at org.apache.poi.hslf.usermodel.SlideShow.<init>(SlideShow.java:109) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:51) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:189) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) This is an exception thrown by POI's PowerPoint (I assume) parser. What this means is you'd want to file issues against the various projects that Tika uses. I'll leave it to others on the list who are more familiar with POI, PDFBox, etc. to provide specific guidance. > Tika conversion failed on following documents > --------------------------------------------- > > Key: TIKA-1089 > URL: https://issues.apache.org/jira/browse/TIKA-1089 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.3 > Environment: windows, api > Reporter: Hong-Thai Nguyen > Labels: test > Attachments: crawler.log > > > We are using Tika as our major converter of divers file formats to text, html > version in a Search Engine. > We've collected some documents (46) which Tika can not convert: > http://www.mediafire.com/?60clr812lerx3gy -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira