[ https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504567#comment-13504567 ]
Nick Burch commented on TIKA-1033: ---------------------------------- Looks like the WindowOneRecord isn't the size that POI expects it to be. Do you know the origin of the file, was it produced by Office or something else? And can you try running the Microsoft Binary File Format Validator tool against it to see if it's actually a valid .xls file or not? Assuming it's a valid file produced by Office, you'll then want to report a POI bug. If it's not a valid file and comes from elsewhere, you'll need to report a bug in the program used to generate the file... > Tika doesn't parse embedded OLE Chart/Graph objects > --------------------------------------------------- > > Key: TIKA-1033 > URL: https://issues.apache.org/jira/browse/TIKA-1033 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Priority: Minor > Attachments: emb.ppt > > > I have an example ppt that embeds a chart, but Tika mis-identifies it > as an XLS document. > The progID (oleShape.getProgID() in > HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and > we seem to detect it as Excel (application/vnd.ms-excel) but then the > ExcelExtractor hits this exception: > {noformat} > org.apache.poi.hssf.record.RecordFormatException: Unable to construct record > instance > at > org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65) > at > org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301) > at > org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285) > at > org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251) > at > org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143) > at > org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106) > at > org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302) > at > org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147) > {noformat} > Since DelegatingParser silently suppresses all exceptions, when you > run TikaCLI you won't see any exception nor text extracted, but if you > run with -z, it will save 1.xls which if you then try to parse with > TikaCLI hits the above exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira