[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13954819#comment-13954819 ]
Chris Bamford commented on TIKA-1010: ------------------------------------- Hi Tim Are you saying you would like to test against RTFs with embedded xls, xlsx, doc, docx, ppt, pptx, pdf files? If so I would be happy to create them ... Let me know - Chris > Embedded documents in RTF are not extracted > ------------------------------------------- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Assignee: Tim Allison > Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, > testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105000002000000080000005061636b61676500000000000000000066000000 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000000030022000000433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b00000048656c6c6f20576f726c64000001050000050000000d0000004d45544146494c455049435400 > 54040000bbfaffffee0000000800540445050000 > 0100090000037300000002001c0000000000050000000b0200000000050000000c02320029001c000000fb02f5ff000000000000900100000001000000005461686f6d61000055170a7000fc070058b1f37761b1f3772040f57749366683040000002d01000005000000090200000000050000000102ffffff0005000000 > 020101000000050000002e0106000000090000002105060048772e747874210015001c000000fb021000070000000000bc02000000000102022253797374656d00004936668300000a0026008a0100000000ffffffff8cfc0700040000002d010100030000000000} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 00000000 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |............Pack| > 00000010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.........f...| > 00000020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 00000030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 00000040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt....."| > 00000050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 00000060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 00000070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.....Hello W| > 00000080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld............| > 00000090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 000000a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.............T.E| > 000000b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.........s......| > 000000c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 |................| > 000000d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.....2.)........| > 000000e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 |................| > 000000f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 00000100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 00000110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.....-..........| > 00000120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 |................| > 00000130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 |................| > 00000140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.....!...Hw.txt!| > 00000150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 |................| > 00000160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.........."Syste| > 00000170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.....&....| > 00000180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...............-| > 00000190 01 01 00 03 00 00 00 00 00 |.........| > 00000199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)