[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13971622#comment-13971622 ] Tim Allison commented on TIKA-1010: --- Great to hear. Thank you for your help in submitting test documents and offering feedback! I'll commit a slightly updated patch tonight or tomorrow. I'd recommend asking on the tika-users list about plans for 1.6 or if there is a nightly build option through Maven. I know that the "nightly" jenkins build has not been working so well. > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, > outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip, > xls_attachment_example.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970597#comment-13970597 ] Chris Bamford commented on TIKA-1010: - Tim I have done a lot of testing now and am very happy with the new functionality. Assuming others have no objections, when could it be made available in a Maven release? Cheers, - Chris > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, > outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip, > xls_attachment_example.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967649#comment-13967649 ] Peter Hamelberg commented on TIKA-1010: --- The RTF objdata destination contains the object data in OLE1 format. The format is described in the document "MS-OLEDS: Object Linking and Embedding (OLE) Data Structures: Structure Specification" [http://msdn.microsoft.com/en-us/library/dd942265.aspx]. The data starts with an ObjectHeader structure. See Chapter 2.2.4 [http://msdn.microsoft.com/en-us/library/dd942076.aspx] and the following, depending on the FormatID value. > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, > outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip, > xls_attachment_example.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13965289#comment-13965289 ] Tim Allison commented on TIKA-1010: --- Interesting... Y, my untested belief is that with the current code you will always extract the same (MD5 match) xls file from a given RTF. However, the extracted file may be quite different from the file that was originally embedded into the RTF. I think Chris found that there are also differences with other MSOffice formats between the file before being embedded and after being extracted from the RTF. > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, > outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip, > xls_attachment_example.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13965283#comment-13965283 ] Luis Filipe Nassif commented on TIKA-1010: -- If you extract the embedded xls file from the same rtf again, the extracted xls files will have the same md5 hash? From a forensic application perspective, it is desirable, at least. I have seen apps that extract a different file each time the embedded ole is extracted. > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, > outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip, > xls_attachment_example.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962059#comment-13962059 ] Tim Allison commented on TIKA-1010: --- Hmmm... In the zip file from April 3, there should be three test files: {noformat} testRTFEmbeddedFiles.rtf testRTFEmbeddedLink.rtf testRTFRegularImages.rtf {noformat} As a triple check, I applied the patch from 4/7 to a fresh check out from trunk, and I put those three files in test-documents. I had a build success (with embarrassing println left in RTFObjDataParser...the horror!...did I mention clean up and a few optimizations remain?). > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, > outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961821#comment-13961821 ] Chris Bamford commented on TIKA-1010: - Thanks Tim, all compiles now. I think at least one test file is missing though (for RTFParserTest): {code} /test-documents/testRTFEmbeddedFiles.rtf /test-documents/testRTFEmbeddedLink.rtf {code} I tried the "testRTF_embbededFiles.rtf" found in "testRTF_embbededFiles.zip" attached to this ticket as the former and that seems to work fine. However, I cannot find anything suitable for the latter. Please can you provide when you get a mo'? I'm digging into the code now to see how it works :-) > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, > outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959988#comment-13959988 ] Tim Allison commented on TIKA-1010: --- trunk svn co http://svn.apache.org/repos/asf/tika/trunk > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, TIKA-1010_patch.zip, outer.rtf, > testRTFRegularImages.rtf, testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959978#comment-13959978 ] Chris Bamford commented on TIKA-1010: - Hi Tim Am about to play with the patch - which version of Tika does it apply to? Presumably I download that and then apply the patch? Thanks - Chris > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, TIKA-1010_patch.zip, outer.rtf, > testRTFRegularImages.rtf, testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956928#comment-13956928 ] Tim Allison commented on TIKA-1010: --- Absolutely, this is more of a question for the tika-users list. One option is to implement EmbeddedResourceHandler and then call it with something like this (take care to do better error handling!): {noformat} TrackingHandler tracker = new TrackingHandler(); TikaInputStream tis = null; try { ContainerExtractor ex = new ParserContainerExtractor(); tis = TikaInputStream.get(inputstream); ex.extract(tis, ex, tracker); } finally { tis.close(); } {noformat} For a simple TrackingHandler, see AbstractPOIContainerExtractionTest in org.apache.tika.parser.microsoft (test/.../parsers). Been delayed on other projects. Wrapping up today, and will post rough patch tomorrow. > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, > testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956274#comment-13956274 ] Chris Bamford commented on TIKA-1010: - Tim A quick question - where do the extracted files get written? Can it be specified? - Chris > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, > testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955104#comment-13955104 ] Tim Allison commented on TIKA-1010: --- Thank you, Chris. The test doc that I posted late last week has examples of embedded (doc|ppt|xls)x? (testRTF_embbededFiles.zip). The issue is that with those files (and pdfs), I don't think that the name of the file is stored in any metadata in the file (nor in the POIFS embobj itself). I just wanted to send a heads up about this apparent limitation in RTF...there are two other places that I still need to check to confirm this...one is the {\result {\shppict}} that immediately follows the embobj (this is the thumbnail pict), and the other is in the \nonshppict that can also follow an embobj. It is entirely possible that the info is stored in POIFS or elsewhere, and I'm just not seeing it. I did add processing to pull the name of the file from: 1) the embobj header if it exists (which it seems to for files not including MS/PDF) and 2) from the pict's metadata {sp {sn}{sv}}. I think I've changed my mind on handling thumbnails separately. My current plan is to extract all embedded data. I'll add info to the Metadata obj about whether the file is an embobj, or a thumbnail {\result {\shppict }} or {\result {\nonshppict}}. The client code can then decide what to do with the embedded data. I'll add binary processing (thank you for your pointer to TIKA-782!), and post a draft of the patch late this afternoon or tomorrow. > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, > testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955048#comment-13955048 ] Chris Bamford commented on TIKA-1010: - Hi Tim I have created an RTF with 5 embedded office docs in but it is too large to attach to the ticket (>10mb). If you're interested we'll have to find some other way to get it to you ... Chris > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, > testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13954819#comment-13954819 ] Chris Bamford commented on TIKA-1010: - Hi Tim Are you saying you would like to test against RTFs with embedded xls, xlsx, doc, docx, ppt, pptx, pdf files? If so I would be happy to create them ... Let me know - Chris > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, > testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950835#comment-13950835 ] Chris Bamford commented on TIKA-1010: - Hi Tim I have found one - please see https://issues.apache.org/jira/browse/TIKA-782 (logo.rtf inside logo.zip). Best, - Chris > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, > testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951116#comment-13951116 ] Tim Allison commented on TIKA-1010: --- As a side note, I can grab file names for: 1) images that are regular images (not embedded objdata) 2) non-POIFS embedded files (html, embedded objdata images, zip, msg) I can't find file names for: xls, xlsx, doc, docx, ppt, pptx, pdf > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, > testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951107#comment-13951107 ] Tim Allison commented on TIKA-1010: --- Y, thanks, I got that. I can add an "extract all" mode vs "extract logical". I'll set the default to "extract logical" unless there are objections from the community. I now have success against the two files I posted earlier today. When I add the "extract all vs logical" parameter, I can use Mike's testBinControlWord (from TIKA-782) as a test. > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, > testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951092#comment-13951092 ] Chris Bamford commented on TIKA-1010: - Ideally I'd like to be able to extract any file, but let's get the main cases covered off first! > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, > testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951097#comment-13951097 ] Chris Bamford commented on TIKA-1010: - The binary actually looks like this: {noformat} {pict {objdata \bin270141 . {noformat} The 270,141 after \bin is the number of bytes to read (size of blob). Note that the blob can (and does in this case!) contain '}' which is not the group end marker but actual 'binary' data. > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, > testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951009#comment-13951009 ] Chris Bamford commented on TIKA-1010: - Hi again Tim Dunno if this helps, but there is a generic RTF parser kit on GitHub (https://github.com/joniles/rtfparserkit) which knows how to navigate RTFs. I'm playing with it now - it doesn't do much except parse but it might provide some insight? Cheers, - Chris > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, > testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951004#comment-13951004 ] Tim Allison commented on TIKA-1010: --- Chris, Thanks for pointing that out. The objdata in logo.rtf is of type pbrush (bitmap) but it is encoded in the regular hexpairs. I just added handling for that. I'm not sure that that is what is meant by the "binary" pict type you pointed out in the spec. {noformat}An RTF file can include pictures created with other applications. These pictures can be in hexadecimal (the default) or binary format. Pictures are destinations, and begin with the \pict control word. The \pict keyword is preceded by \*\shppict destination control keyword as described in the following example. A picture destination has the following syntax: '{' \pict (? & ? & & & ?) '}' | \emfblip | \pngblip | \jpegblip | \macpict | \pmmetafile | \wmetafile | \dibitmap | \wbitmap \wbmbitspixel & \wbmplanes & \wbmwidthbytes (\picw & \pich) \picwgoal? & \pichgoal? \picscalex? & \picscaley? & \picscaled? & \piccropt? & \piccropb? & \piccropr? & \piccropl? \picbmp & \picbpp (\bin #BDATA) | #SDATA {noformat} My guess from that is that we'd see something like: {noformat} {pict {objdata \bin 0101000100010001 {noformat} > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, > testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950714#comment-13950714 ] Chris Bamford commented on TIKA-1010: - Hi Tim Sorry about the confusion with the GIFs - they were nothing to do with the case! Thy were part of my email footer which Jira automatically attached to the ticket when I replied by email! So I removed them. Sounds like you're making great progress. I will provide a binary pict file as soon as I can locate one. Cheers, - Chris > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, > testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950628#comment-13950628 ] Tim Allison commented on TIKA-1010: --- Chris, Thank you for digging into the spec and sharing test files. For some reason, I can't find the gifs that JIRA reported you attaching earlier today. Y, I'm in search of a binary test file. Please share one if you can find it. I think I'm good on package files. I'll attach my two main test files shortly. I've tested on the files within the zip that you submitted earlier, and all is good. > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, outer.rtf > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13943423#comment-13943423 ] Tim Allison commented on TIKA-1010: --- In {themedata, I'm seeing the magic 50 4B (PK)...this is promising. Won't have a chance to work on this for a bit, though. > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless > Attachments: outer.rtf > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13943404#comment-13943404 ] Tim Allison commented on TIKA-1010: --- This might be of use: http://palashray.com/2006/10/25/embedding-an-image-in-rtf-with-java > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)