[ 
https://issues.apache.org/jira/browse/TIKA-877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234291#comment-13234291
 ] 

Maxim Valyanskiy commented on TIKA-877:
---------------------------------------

I think it is not a real problem, because "file5" is invalid Ole10Native 
attachement.

Tika 1.0 saves internal data stream of that entry prepended by some headers 
that it could not parse. Current (trunk) version saves complete Ole10Native 
stream when entry is not valid.
                
> Embedded document not extracted (regression)
> --------------------------------------------
>
>                 Key: TIKA-877
>                 URL: https://issues.apache.org/jira/browse/TIKA-877
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.1
>            Reporter: Daniel Bonniot de Ruisselet
>            Assignee: Maxim Valyanskiy
>            Priority: Blocker
>              Labels: regression
>             Fix For: 1.1
>
>         Attachments: coffee.xls
>
>
> Testing the 1.1 rc, I believe I found a regression, hence the priority.
> {noformat}
> dbonniot-t520 /tmp/1.0 java -jar ../tika-app-1.0.jar -z ../coffee.xls 
> Extracting 'file0.wmf' (application/x-msmetafile)
> Extracting 'file1.wmf' (application/x-msmetafile)
> Extracting 'file2.wmf' (application/x-msmetafile)
> Extracting 'file3.wmf' (application/x-msmetafile)
> Extracting 'file4.png' (image/png)
> Extracting 'MBD002B040A.wps' (application/vnd.ms-works)
> Extracting 'file5.bin' (application/octet-stream)
> Extracting 'MBD00262FE3.unknown' (application/x-tika-msoffice)
> dbonniot-t520 /tmp/1.0 cd ../1.1
> dbonniot-t520 /tmp/1.1 java -jar ../tika-app-1.1.jar -z ../coffee.xls 
> Extracting 'file0.emf' (application/x-emf)
> Extracting 'file1.emf' (application/x-emf)
> Extracting 'file2.emf' (application/x-emf)
> Extracting 'file3.emf' (application/x-emf)
> Extracting 'file4.png' (image/png)
> Extracting 'MBD002B040A.wps' (application/vnd.ms-works)
> Extracting 'file5' (application/x-tika-msoffice-embedded)
> Extracting 'MBD00262FE3.unknown' (application/x-tika-msoffice)
> dbonniot-t520 /tmp/1.1 ls -l ../1.0/file5.bin ../1.1/file5 
> -rw-r--r-- 1 dbonniot dbonniot 2519 2012-03-18 21:51 ../1.0/file5.bin
> -rw-r--r-- 1 dbonniot dbonniot    0 2012-03-18 21:51 ../1.1/file5
> {noformat}
> Notice how 1.0 could extract the data for file5, but 1.1 creates an empty 
> file instead.
> By the way, I do see improvements in 1.1 as well, congrats for that!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to