[ https://issues.apache.org/jira/browse/TIKA-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brian McColgan closed TIKA-2588. -------------------------------- Issue resolved very quickly effectively by the maestro Tika-developer T.A. Thank you once again, you rock! > Tika detecting/parsing pptx with embedded Excel worksheet(s)... > --------------------------------------------------------------- > > Key: TIKA-2588 > URL: https://issues.apache.org/jira/browse/TIKA-2588 > Project: Tika > Issue Type: Bug > Components: detector, parser > Affects Versions: 1.17 > Environment: > Reporter: Brian McColgan > Assignee: Tim Allison > Priority: Major > Fix For: 1.18, 2.0.0 > > Attachments: foo.out, pptEmbedExcelDoubleClickFromWorkbook.PNG, > pptEmbedExcelInEmptyWorkbook.PNG, tikaSample.pptx > > > Hello tika-developers, > First, a big 'thank-you' for creating and maintaining Apache-Tika! A really > useful capability/service that can be used in so many different ways. You > folks are the true Debabelizer (h2g2.com). > On to issue-encountered: using Tika 1.17 to extract an embedded Excel object > out of a pptx is causing issues. Simple example attached to this Jira-issue > ([^tikaSample.pptx]) which if run against Tika 1.17 (with > verbose/list-parsers/list-detectors) provides the output in ([^foo.out]). > The deck contains a title slide, and a single-slide with embedded Excel > object on it. > As noted to [~gagravarr] on S-Overflow, I grabbed the unit-test data which > you use in your parser/office JUnit suite (test_ppt_embedded_two_slides.pptx) > and tried opening in Office/PPT 2016. I selected (with mouse) the embedded > sheet (had Alfresco logo in it) and pasted it into an empty Office/Excel 2016 > workbook. When I tried to interact with it, I had to double-click to make it > active. As a result, I ended up with two Excel instances on my Windows 10 > desktop (the original object in 1, the Excel worksheet in another). I have > included a picture of the embedded Excel object pasted into the workbook... > !pptEmbedExcelInEmptyWorkbook.PNG! ). > followed by the worksheet opened inside the workbook (required double-click > within the black-bordered area in the first pic above): > !pptEmbedExcelDoubleClickFromWorkbook.PNG! > I managed to extract the embedded object using apache POI. The logic > sequence was something like the following: > # Create an XMLSlideShow object, and pull the list of underlying slide > entities. > # Walk the list of XSLFSlide(s), searching for a matching slide (by name) - > e.g. 'MFL'. > # Examine PackagePart of XSLFSlide (matching name) and for content-type. > # If pPart.content-type is > 'application/vnd.openxmlformats-officedocument.oleObject' then - 'candidate > FOUND'. > # Build POIFS around the candidate FOUND, extract root of FileSystem. > # Verify that root has entries for \{ 'Package', '\u0001Ole', and > '\u0001CompObj' }. > # Extract entry '\u0001CompObj', verify entry is a DocumentEntry and > underlying bytes for DocumentNode match an 'Excel' signature. > # If (step 7 is true) -> extract entry 'Package'. > # The resulting entry represents the byte-stream of the embedded Excel > entity. > I was able to instantiate this into a new workbook (as an example) using POI, > and when I opened it, the worksheet was correctly embedded in that > 'example.xlsx'. > I am not as familiar with Tika, so was a little less comfortable trying to > walk it through. I thought however, recreating this path would provide > further insight for you. -- This message was sent by Atlassian JIRA (v7.6.3#76005)