[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-04-01 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956928#comment-13956928
 ] 

Tim Allison commented on TIKA-1010:
---

Absolutely, this is more of a question for the tika-users list.  One option is 
to implement EmbeddedResourceHandler and then call it with something like this 
(take care to do better error handling!):

{noformat}
TrackingHandler tracker = new TrackingHandler();
TikaInputStream tis = null;
try {
ContainerExtractor ex = new ParserContainerExtractor();
tis = TikaInputStream.get(inputstream);
ex.extract(tis, ex, tracker);
} finally {
tis.close();
}
{noformat}

For a simple TrackingHandler, see AbstractPOIContainerExtractionTest in 
org.apache.tika.parser.microsoft (test/.../parsers).

Been delayed on other projects.  Wrapping up today, and will post rough patch 
tomorrow.

> Embedded documents in RTF are not extracted
> ---
>
> Key: TIKA-1010
> URL: https://issues.apache.org/jira/browse/TIKA-1010
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Tim Allison
> Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, 
> testRTF_embbededFiles.zip
>
>
> When an RTF doc embeds a doc it looks like this:
> {noformat}
> {\object\objemb
> \objw628\objh765{\*\objclass Package}{\*\objdata 
> 0105020008005061636b616765006600
> 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
> 5404bbfaee00080054044505
> 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
> 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
> {noformat}
> But, unfortunately, the format of those hex bytes is not spelled out
> in the RTF spec ... the spec merely says the bytes are saved by the
> OLESaveToStream function ... and I haven't been able to find a
> description of what the bytes mean.
> In this case they are a "Package object" (\objclass Package), which I
> think is an [old?] way to wrap any non-OLE file (this is just a .txt
> file).
> Here's the hex dump:
> {noformat}
>   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
> 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
> 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
> 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
> 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt."|
> 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
> 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
> 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
> 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
> 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
> 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
> 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
> 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
> 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
> 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
> 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
> 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
> 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
> 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
> 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
> 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
> 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
> 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |.."Syste|
> 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.&|
> 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
> 0190  01 01 00 03 00 00 00 00  00   |.|
> 0

Re: Add Outlook/PST files to supported formats on the web site?

2014-04-01 Thread Michael McCandless
OK thanks!

Mike McCandless

http://blog.mikemccandless.com


On Tue, Apr 1, 2014 at 7:46 AM, Hong-Thai Nguyen
 wrote:
> Yes, but from 1.6: https://issues.apache.org/jira/browse/TIKA-623
> I'm finishing return mails as extracted documents as demand, but we'll have 
> this format in 1.6.
>
> Hong-Thai
>
>
> -Message d'origine-
> De : Michael McCandless [mailto:luc...@mikemccandless.com]
> Envoyé : mardi 1 avril 2014 13:42
> À : dev@tika.apache.org
> Objet : Add Outlook/PST files to supported formats on the web site?
>
> We only seem to list mbox (Unix) email format:
>
> https://tika.apache.org/1.5/formats.html
>
> But Tika can also extract messages from Outlook's PST files?
>
> Mike McCandless
>
> http://blog.mikemccandless.com


RE: Add Outlook/PST files to supported formats on the web site?

2014-04-01 Thread Hong-Thai Nguyen
Yes, but from 1.6: https://issues.apache.org/jira/browse/TIKA-623
I'm finishing return mails as extracted documents as demand, but we'll have 
this format in 1.6.

Hong-Thai


-Message d'origine-
De : Michael McCandless [mailto:luc...@mikemccandless.com] 
Envoyé : mardi 1 avril 2014 13:42
À : dev@tika.apache.org
Objet : Add Outlook/PST files to supported formats on the web site?

We only seem to list mbox (Unix) email format:

https://tika.apache.org/1.5/formats.html

But Tika can also extract messages from Outlook's PST files?

Mike McCandless

http://blog.mikemccandless.com


Add Outlook/PST files to supported formats on the web site?

2014-04-01 Thread Michael McCandless
We only seem to list mbox (Unix) email format:

https://tika.apache.org/1.5/formats.html

But Tika can also extract messages from Outlook's PST files?

Mike McCandless

http://blog.mikemccandless.com


[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-04-01 Thread Chris Bamford (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956274#comment-13956274
 ] 

Chris Bamford commented on TIKA-1010:
-

Tim
A quick question - where do the extracted files get written?  Can it be 
specified?
- Chris

> Embedded documents in RTF are not extracted
> ---
>
> Key: TIKA-1010
> URL: https://issues.apache.org/jira/browse/TIKA-1010
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Tim Allison
> Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, 
> testRTF_embbededFiles.zip
>
>
> When an RTF doc embeds a doc it looks like this:
> {noformat}
> {\object\objemb
> \objw628\objh765{\*\objclass Package}{\*\objdata 
> 0105020008005061636b616765006600
> 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
> 5404bbfaee00080054044505
> 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
> 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
> {noformat}
> But, unfortunately, the format of those hex bytes is not spelled out
> in the RTF spec ... the spec merely says the bytes are saved by the
> OLESaveToStream function ... and I haven't been able to find a
> description of what the bytes mean.
> In this case they are a "Package object" (\objclass Package), which I
> think is an [old?] way to wrap any non-OLE file (this is just a .txt
> file).
> Here's the hex dump:
> {noformat}
>   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
> 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
> 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
> 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
> 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt."|
> 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
> 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
> 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
> 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
> 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
> 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
> 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
> 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
> 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
> 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
> 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
> 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
> 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
> 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
> 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
> 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
> 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
> 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |.."Syste|
> 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.&|
> 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
> 0190  01 01 00 03 00 00 00 00  00   |.|
> 0199
> {noformat}
> Anyway I have no idea how to decode the bytes at this point ... just
> opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)