[ 
https://issues.apache.org/jira/browse/TIKA-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13587284#comment-13587284
 ] 

Ken Krugler commented on TIKA-1089:
-----------------------------------

Hi Hong-Thai,

I took a quick look at crawler.log (thanks for attaching that file), and these 
are failures thrown by the underlying parsing libraries used by Tika. For 
example:

Caused by: java.lang.ArrayIndexOutOfBoundsException: 140
        at 
org.apache.poi.hslf.usermodel.SlideShow.buildSlidesAndNotes(SlideShow.java:405)
        at org.apache.poi.hslf.usermodel.SlideShow.<init>(SlideShow.java:109)
        at 
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:51)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:189)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)

This is an exception thrown by POI's PowerPoint (I assume) parser.

What this means is you'd want to file issues against the various projects that 
Tika uses.

I'll leave it to others on the list who are more familiar with POI, PDFBox, 
etc. to provide specific guidance.
                
> Tika conversion failed on following documents
> ---------------------------------------------
>
>                 Key: TIKA-1089
>                 URL: https://issues.apache.org/jira/browse/TIKA-1089
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>         Environment: windows, api
>            Reporter: Hong-Thai Nguyen
>              Labels: test
>         Attachments: crawler.log
>
>
> We are using Tika as our major converter of divers file formats to text, html 
> version in a Search Engine.
> We've collected some documents (46) which Tika can not convert: 
> http://www.mediafire.com/?60clr812lerx3gy

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to