[
https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504570#comment-13504570
]
Michael McCandless commented on TIKA-1033:
------------------------------------------
I think emb.ppt was explicitly created as a test case, but not by me ... I'll
see if I can get the details.
OK I just ran the attached emb.ppt through the Microsoft Binary File Format
Validator tool and it passed, but when I run it on 1.xls (which TikaCLI -z had
saved, from the embedded Chart), it fails with this message:
{noformat}
BFFValidator: "x:\tmp\1.xls" NOT RECOGNIZED (The Microsoft Office Binary File Fo
rmat Validator encountered an error reading the file you specified, OR The Micro
soft Office Binary File Format Validator supports Word, Excel, and PowerPoint bi
nary file formats only. The file you specified is an unsupported file type.) at
11/27/12 07:23:58
{noformat}
It sounds like the tool doesn't expect to get a "raw" chart object? (Tika is
mis-identifying this embedded chart object as XLS and saving 1.xls). Either
that or somehow Tika saved the wrong bits when it extracted the embedded chart
object?
> Tika doesn't parse embedded OLE Chart/Graph objects
> ---------------------------------------------------
>
> Key: TIKA-1033
> URL: https://issues.apache.org/jira/browse/TIKA-1033
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Michael McCandless
> Priority: Minor
> Attachments: emb.ppt
>
>
> I have an example ppt that embeds a chart, but Tika mis-identifies it
> as an XLS document.
> The progID (oleShape.getProgID() in
> HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and
> we seem to detect it as Excel (application/vnd.ms-excel) but then the
> ExcelExtractor hits this exception:
> {noformat}
> org.apache.poi.hssf.record.RecordFormatException: Unable to construct record
> instance
> at
> org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
> at
> org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
> at
> org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
> at
> org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
> at
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
> at
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
> at
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302)
> at
> org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147)
> {noformat}
> Since DelegatingParser silently suppresses all exceptions, when you
> run TikaCLI you won't see any exception nor text extracted, but if you
> run with -z, it will save 1.xls which if you then try to parse with
> TikaCLI hits the above exception.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira