[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated TIKA-1010: ------------------------------ Attachment: xls_attachment_example.zip Many thanks to [~cbamford] for pointing out this issue...When an xls file is attached to an RTF file and then extracted, the XLS workbook is made "hidden" and the size of the workbook can be much, much larger, especially if there is an embedded image. The same thing happens when the RTF file is saved as a DOCX, and then the xls file is extracted via the zip interface. I'm attaching a zip of an original xls file and the resulting xls file after it was embedded and then extracted from an RTF. For this issue, I propose ignoring this difference. If we want to add extra processing for "unhiding" workbooks, let's do that later in a separate issue. Any strong objections? > Embedded documents in RTF are not extracted > ------------------------------------------- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Assignee: Tim Allison > Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, > outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip, > xls_attachment_example.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105000002000000080000005061636b61676500000000000000000066000000 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000000030022000000433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b00000048656c6c6f20576f726c64000001050000050000000d0000004d45544146494c455049435400 > 54040000bbfaffffee0000000800540445050000 > 0100090000037300000002001c0000000000050000000b0200000000050000000c02320029001c000000fb02f5ff000000000000900100000001000000005461686f6d61000055170a7000fc070058b1f37761b1f3772040f57749366683040000002d01000005000000090200000000050000000102ffffff0005000000 > 020101000000050000002e0106000000090000002105060048772e747874210015001c000000fb021000070000000000bc02000000000102022253797374656d00004936668300000a0026008a0100000000ffffffff8cfc0700040000002d010100030000000000} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 00000000 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |............Pack| > 00000010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.........f...| > 00000020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 00000030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 00000040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt....."| > 00000050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 00000060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 00000070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.....Hello W| > 00000080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld............| > 00000090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 000000a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.............T.E| > 000000b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.........s......| > 000000c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 |................| > 000000d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.....2.)........| > 000000e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 |................| > 000000f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 00000100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 00000110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.....-..........| > 00000120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 |................| > 00000130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 |................| > 00000140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.....!...Hw.txt!| > 00000150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 |................| > 00000160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.........."Syste| > 00000170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.....&....| > 00000180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...............-| > 00000190 01 01 00 03 00 00 00 00 00 |.........| > 00000199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)