[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956928#comment-13956928 ] Tim Allison commented on TIKA-1010: --- Absolutely, this is more of a question for the tika-users list. One option is to implement EmbeddedResourceHandler and then call it with something like this (take care to do better error handling!): {noformat} TrackingHandler tracker = new TrackingHandler(); TikaInputStream tis = null; try { ContainerExtractor ex = new ParserContainerExtractor(); tis = TikaInputStream.get(inputstream); ex.extract(tis, ex, tracker); } finally { tis.close(); } {noformat} For a simple TrackingHandler, see AbstractPOIContainerExtractionTest in org.apache.tika.parser.microsoft (test/.../parsers). Been delayed on other projects. Wrapping up today, and will post rough patch tomorrow. > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, > testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0
Re: Add Outlook/PST files to supported formats on the web site?
OK thanks! Mike McCandless http://blog.mikemccandless.com On Tue, Apr 1, 2014 at 7:46 AM, Hong-Thai Nguyen wrote: > Yes, but from 1.6: https://issues.apache.org/jira/browse/TIKA-623 > I'm finishing return mails as extracted documents as demand, but we'll have > this format in 1.6. > > Hong-Thai > > > -Message d'origine- > De : Michael McCandless [mailto:luc...@mikemccandless.com] > Envoyé : mardi 1 avril 2014 13:42 > À : dev@tika.apache.org > Objet : Add Outlook/PST files to supported formats on the web site? > > We only seem to list mbox (Unix) email format: > > https://tika.apache.org/1.5/formats.html > > But Tika can also extract messages from Outlook's PST files? > > Mike McCandless > > http://blog.mikemccandless.com
RE: Add Outlook/PST files to supported formats on the web site?
Yes, but from 1.6: https://issues.apache.org/jira/browse/TIKA-623 I'm finishing return mails as extracted documents as demand, but we'll have this format in 1.6. Hong-Thai -Message d'origine- De : Michael McCandless [mailto:luc...@mikemccandless.com] Envoyé : mardi 1 avril 2014 13:42 À : dev@tika.apache.org Objet : Add Outlook/PST files to supported formats on the web site? We only seem to list mbox (Unix) email format: https://tika.apache.org/1.5/formats.html But Tika can also extract messages from Outlook's PST files? Mike McCandless http://blog.mikemccandless.com
Add Outlook/PST files to supported formats on the web site?
We only seem to list mbox (Unix) email format: https://tika.apache.org/1.5/formats.html But Tika can also extract messages from Outlook's PST files? Mike McCandless http://blog.mikemccandless.com
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956274#comment-13956274 ] Chris Bamford commented on TIKA-1010: - Tim A quick question - where do the extracted files get written? Can it be specified? - Chris > Embedded documents in RTF are not extracted > --- > > Key: TIKA-1010 > URL: https://issues.apache.org/jira/browse/TIKA-1010 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless >Assignee: Tim Allison > Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, > testRTF_embbededFiles.zip > > > When an RTF doc embeds a doc it looks like this: > {noformat} > {\object\objemb > \objw628\objh765{\*\objclass Package}{\*\objdata > 0105020008005061636b616765006600 > 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 > 5404bbfaee00080054044505 > 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 > 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} > {noformat} > But, unfortunately, the format of those hex bytes is not spelled out > in the RTF spec ... the spec merely says the bytes are saved by the > OLESaveToStream function ... and I haven't been able to find a > description of what the bytes mean. > In this case they are a "Package object" (\objclass Package), which I > think is an [old?] way to wrap any non-OLE file (this is just a .txt > file). > Here's the hex dump: > {noformat} > 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| > 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| > 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| > 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| > 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt."| > 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| > 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| > 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| > 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| > 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| > 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| > 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| > 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || > 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| > 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || > 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| > 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| > 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| > 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || > 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || > 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| > 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || > 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |.."Syste| > 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.&| > 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| > 0190 01 01 00 03 00 00 00 00 00 |.| > 0199 > {noformat} > Anyway I have no idea how to decode the bytes at this point ... just > opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)