[ 
https://issues.apache.org/jira/browse/TIKA-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504673#comment-13504673
 ] 

Michael McCandless commented on TIKA-1033:
------------------------------------------

bq. The "raw chart object" looks to actually be an excel file, 

Hmm, so now I'm very confused :)  Did something go wrong when Tika pulled out 
the bits from emb.ppt to create 1.xls?  When I try to open 1.xls in Excel it's 
unhappy ("Cannot open Microsoft Graph chart gallery files.").

bq. Note that embedded objects in office files are actually stored as the raw 
object (used for editing), and a rendered version of the file (so that viewing 
the parent document is quick, normally an EMF)

Yeah I see separately the *.emf files being extracted by TikaCLI.
                
> Tika doesn't parse embedded OLE Chart/Graph objects
> ---------------------------------------------------
>
>                 Key: TIKA-1033
>                 URL: https://issues.apache.org/jira/browse/TIKA-1033
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: emb.ppt
>
>
> I have an example ppt that embeds a chart, but Tika mis-identifies it
> as an XLS document.
> The progID (oleShape.getProgID() in
> HSLFExtractor.handleSlideEmbeddedResources) is MSGraph.Chart.8 ... and
> we seem to detect it as Excel (application/vnd.ms-excel) but then the
> ExcelExtractor hits this exception:
> {noformat}
> org.apache.poi.hssf.record.RecordFormatException: Unable to construct record 
> instance
>       at 
> org.apache.poi.hssf.record.RecordFactory$ReflectionConstructorRecordCreator.create(RecordFactory.java:65)
>       at 
> org.apache.poi.hssf.record.RecordFactory.createSingleRecord(RecordFactory.java:301)
>       at 
> org.apache.poi.hssf.record.RecordFactoryInputStream.readNextRecord(RecordFactoryInputStream.java:285)
>       at 
> org.apache.poi.hssf.record.RecordFactoryInputStream.nextRecord(RecordFactoryInputStream.java:251)
>       at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:143)
>       at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
>       at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:302)
>       at 
> org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:147)
> {noformat}
> Since DelegatingParser silently suppresses all exceptions, when you
> run TikaCLI you won't see any exception nor text extracted, but if you
> run with -z, it will save 1.xls which if you then try to parse with
> TikaCLI hits the above exception.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to