[ https://issues.apache.org/jira/browse/TIKA-877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234291#comment-13234291 ]
Maxim Valyanskiy commented on TIKA-877: --------------------------------------- I think it is not a real problem, because "file5" is invalid Ole10Native attachement. Tika 1.0 saves internal data stream of that entry prepended by some headers that it could not parse. Current (trunk) version saves complete Ole10Native stream when entry is not valid. > Embedded document not extracted (regression) > -------------------------------------------- > > Key: TIKA-877 > URL: https://issues.apache.org/jira/browse/TIKA-877 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.1 > Reporter: Daniel Bonniot de Ruisselet > Assignee: Maxim Valyanskiy > Priority: Blocker > Labels: regression > Fix For: 1.1 > > Attachments: coffee.xls > > > Testing the 1.1 rc, I believe I found a regression, hence the priority. > {noformat} > dbonniot-t520 /tmp/1.0 java -jar ../tika-app-1.0.jar -z ../coffee.xls > Extracting 'file0.wmf' (application/x-msmetafile) > Extracting 'file1.wmf' (application/x-msmetafile) > Extracting 'file2.wmf' (application/x-msmetafile) > Extracting 'file3.wmf' (application/x-msmetafile) > Extracting 'file4.png' (image/png) > Extracting 'MBD002B040A.wps' (application/vnd.ms-works) > Extracting 'file5.bin' (application/octet-stream) > Extracting 'MBD00262FE3.unknown' (application/x-tika-msoffice) > dbonniot-t520 /tmp/1.0 cd ../1.1 > dbonniot-t520 /tmp/1.1 java -jar ../tika-app-1.1.jar -z ../coffee.xls > Extracting 'file0.emf' (application/x-emf) > Extracting 'file1.emf' (application/x-emf) > Extracting 'file2.emf' (application/x-emf) > Extracting 'file3.emf' (application/x-emf) > Extracting 'file4.png' (image/png) > Extracting 'MBD002B040A.wps' (application/vnd.ms-works) > Extracting 'file5' (application/x-tika-msoffice-embedded) > Extracting 'MBD00262FE3.unknown' (application/x-tika-msoffice) > dbonniot-t520 /tmp/1.1 ls -l ../1.0/file5.bin ../1.1/file5 > -rw-r--r-- 1 dbonniot dbonniot 2519 2012-03-18 21:51 ../1.0/file5.bin > -rw-r--r-- 1 dbonniot dbonniot 0 2012-03-18 21:51 ../1.1/file5 > {noformat} > Notice how 1.0 could extract the data for file5, but 1.1 creates an empty > file instead. > By the way, I do see improvements in 1.1 as well, congrats for that! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira