[jira] [Comment Edited] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962059#comment-13962059 ] Tim Allison edited comment on TIKA-1010 at 4/7/14 5:54 PM: --- Hmmm... Sorry, I should have been clearer... In the last patch update, I only included the code patch. To get the test files go back to the zip file from April 3, and there should be three test files: {noformat} testRTFEmbeddedFiles.rtf testRTFEmbeddedLink.rtf testRTFRegularImages.rtf {noformat} As a triple check, I applied the patch from 4/7 to a fresh check out from trunk, and I put those three files in test-documents. I had a build success (with embarrassing println left in RTFObjDataParser...the horror!...did I mention clean up and a few optimizations remain?). was (Author: talli...@mitre.org): Hmmm... Sorry, I should have been clearer... In the last patch update, I only included the code patch. To get teh test files go back to thezip file from April 3, and there should be three test files: {noformat} testRTFEmbeddedFiles.rtf testRTFEmbeddedLink.rtf testRTFRegularImages.rtf {noformat} As a triple check, I applied the patch from 4/7 to a fresh check out from trunk, and I put those three files in test-documents. I had a build success (with embarrassing println left in RTFObjDataParser...the horror!...did I mention clean up and a few optimizations remain?). > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, > outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00
[jira] [Comment Edited] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962059#comment-13962059 ] Tim Allison edited comment on TIKA-1010 at 4/7/14 5:51 PM: --- Hmmm... Sorry, I should have been clearer... In the last patch update, I only included the code patch. To get teh test files go back to thezip file from April 3, and there should be three test files: {noformat} testRTFEmbeddedFiles.rtf testRTFEmbeddedLink.rtf testRTFRegularImages.rtf {noformat} As a triple check, I applied the patch from 4/7 to a fresh check out from trunk, and I put those three files in test-documents. I had a build success (with embarrassing println left in RTFObjDataParser...the horror!...did I mention clean up and a few optimizations remain?). was (Author: talli...@mitre.org): Hmmm... In the zip file from April 3, there should be three test files: {noformat} testRTFEmbeddedFiles.rtf testRTFEmbeddedLink.rtf testRTFRegularImages.rtf {noformat} As a triple check, I applied the patch from 4/7 to a fresh check out from trunk, and I put those three files in test-documents. I had a build success (with embarrassing println left in RTFObjDataParser...the horror!...did I mention clean up and a few optimizations remain?). > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, > outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 0
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962059#comment-13962059 ] Tim Allison commented on TIKA-1010: --- Hmmm... In the zip file from April 3, there should be three test files: {noformat} testRTFEmbeddedFiles.rtf testRTFEmbeddedLink.rtf testRTFRegularImages.rtf {noformat} As a triple check, I applied the patch from 4/7 to a fresh check out from trunk, and I put those three files in test-documents. I had a build success (with embarrassing println left in RTFObjDataParser...the horror!...did I mention clean up and a few optimizations remain?). > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, > outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1255) WordExtractor - bold hyperlink not closed properly
[ https://issues.apache.org/jira/browse/TIKA-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Hunter updated TIKA-1255: -- Attachment: WordParserTest.java testWORD_strikethrough_hyperlink.doc testWORD_italic_hyperlink.doc testWORD_bold_hyperlink.doc WordExtractor.java Suggested patch, test and test documents > WordExtractor - bold hyperlink not closed properly > -- > > Key: TIKA-1255 > URL: https://issues.apache.org/jira/browse/TIKA-1255 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2, 1.3, 1.4, 1.5 > Environment: Any >Reporter: Alan Hunter >Priority: Minor > Attachments: WordExtractor.java, WordParserTest.java, example.doc, > testWORD_bold_hyperlink.doc, testWORD_italic_hyperlink.doc, > testWORD_strikethrough_hyperlink.doc > > > If a Word document contains a bold hyperlink, the resulting xhtml is: > href="http://www.testdomain.com/support/workcentre-7232-7242/file-download/enus.html?operatingSystem=macosx108&fileLanguage=en&contentId=126220&from=downloads&viewArchived=false";>Test > link > The closing bold and anchor tags are transposed, which isn't valid XHTML. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1255) WordExtractor - bold hyperlink not closed properly
[ https://issues.apache.org/jira/browse/TIKA-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961914#comment-13961914 ] Alan Hunter commented on TIKA-1255: --- I have attached a suggested fix, test and test documents to improve the resilience of the Word parser when handling styled hyperlinks > WordExtractor - bold hyperlink not closed properly > -- > > Key: TIKA-1255 > URL: https://issues.apache.org/jira/browse/TIKA-1255 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.2, 1.3, 1.4, 1.5 > Environment: Any >Reporter: Alan Hunter >Priority: Minor > Attachments: WordExtractor.java, WordParserTest.java, example.doc, > testWORD_bold_hyperlink.doc, testWORD_italic_hyperlink.doc, > testWORD_strikethrough_hyperlink.doc > > > If a Word document contains a bold hyperlink, the resulting xhtml is: > href="http://www.testdomain.com/support/workcentre-7232-7242/file-download/enus.html?operatingSystem=macosx108&fileLanguage=en&contentId=126220&from=downloads&viewArchived=false";>Test > link > The closing bold and anchor tags are transposed, which isn't valid XHTML. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961821#comment-13961821 ] Chris Bamford commented on TIKA-1010: - Thanks Tim, all compiles now. I think at least one test file is missing though (for RTFParserTest): {code} /test-documents/testRTFEmbeddedFiles.rtf /test-documents/testRTFEmbeddedLink.rtf {code} I tried the "testRTF_embbededFiles.rtf" found in "testRTF_embbededFiles.zip" attached to this ticket as the former and that seems to work fine. However, I cannot find anything suitable for the latter. Please can you provide when you get a mo'? I'm digging into the code now to see how it works :-) > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, > outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1010: -- Attachment: TIKA-1010.patch Doh! With metadata class added... > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, > outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)