[jira] [Comment Edited] (TIKA-1010) Embedded documents in RTF are not extracted

2014-04-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962059#comment-13962059
 ] 

Tim Allison edited comment on TIKA-1010 at 4/7/14 5:54 PM:
---

Hmmm... Sorry, I should have been clearer... In the last patch update, I only 
included the code patch.  To get the test files go back to the zip file from 
April 3, and there should be three test files:

{noformat}
testRTFEmbeddedFiles.rtf
testRTFEmbeddedLink.rtf
testRTFRegularImages.rtf
{noformat}

As a triple check, I applied the patch from 4/7 to a fresh check out from 
trunk, and I put those three files in test-documents.  I had a build success 
(with embarrassing println left in RTFObjDataParser...the horror!...did I 
mention clean up and a few optimizations remain?).




was (Author: talli...@mitre.org):
Hmmm... Sorry, I should have been clearer... In the last patch update, I only 
included the code patch.  To get teh test files go back to thezip file from 
April 3, and there should be three test files:

{noformat}
testRTFEmbeddedFiles.rtf
testRTFEmbeddedLink.rtf
testRTFRegularImages.rtf
{noformat}

As a triple check, I applied the patch from 4/7 to a fresh check out from 
trunk, and I put those three files in test-documents.  I had a build success 
(with embarrassing println left in RTFObjDataParser...the horror!...did I 
mention clean up and a few optimizations remain?).



> Embedded documents in RTF are not extracted
> ---
>
> Key: TIKA-1010
> URL: https://issues.apache.org/jira/browse/TIKA-1010
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Tim Allison
> Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, 
> outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip
>
>
> When an RTF doc embeds a doc it looks like this:
> {noformat}
> {\object\objemb
> \objw628\objh765{\*\objclass Package}{\*\objdata 
> 0105020008005061636b616765006600
> 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
> 5404bbfaee00080054044505
> 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
> 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
> {noformat}
> But, unfortunately, the format of those hex bytes is not spelled out
> in the RTF spec ... the spec merely says the bytes are saved by the
> OLESaveToStream function ... and I haven't been able to find a
> description of what the bytes mean.
> In this case they are a "Package object" (\objclass Package), which I
> think is an [old?] way to wrap any non-OLE file (this is just a .txt
> file).
> Here's the hex dump:
> {noformat}
>   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
> 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
> 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
> 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
> 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt."|
> 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
> 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
> 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
> 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
> 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
> 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
> 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
> 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
> 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
> 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
> 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
> 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
> 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
> 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
> 0130  00 02 01 01 00 00 00 05  00 00 

[jira] [Comment Edited] (TIKA-1010) Embedded documents in RTF are not extracted

2014-04-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962059#comment-13962059
 ] 

Tim Allison edited comment on TIKA-1010 at 4/7/14 5:51 PM:
---

Hmmm... Sorry, I should have been clearer... In the last patch update, I only 
included the code patch.  To get teh test files go back to thezip file from 
April 3, and there should be three test files:

{noformat}
testRTFEmbeddedFiles.rtf
testRTFEmbeddedLink.rtf
testRTFRegularImages.rtf
{noformat}

As a triple check, I applied the patch from 4/7 to a fresh check out from 
trunk, and I put those three files in test-documents.  I had a build success 
(with embarrassing println left in RTFObjDataParser...the horror!...did I 
mention clean up and a few optimizations remain?).




was (Author: talli...@mitre.org):
Hmmm... In the zip file from April 3, there should be three test files:

{noformat}
testRTFEmbeddedFiles.rtf
testRTFEmbeddedLink.rtf
testRTFRegularImages.rtf
{noformat}

As a triple check, I applied the patch from 4/7 to a fresh check out from 
trunk, and I put those three files in test-documents.  I had a build success 
(with embarrassing println left in RTFObjDataParser...the horror!...did I 
mention clean up and a few optimizations remain?).



> Embedded documents in RTF are not extracted
> ---
>
> Key: TIKA-1010
> URL: https://issues.apache.org/jira/browse/TIKA-1010
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Tim Allison
> Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, 
> outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip
>
>
> When an RTF doc embeds a doc it looks like this:
> {noformat}
> {\object\objemb
> \objw628\objh765{\*\objclass Package}{\*\objdata 
> 0105020008005061636b616765006600
> 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
> 5404bbfaee00080054044505
> 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
> 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
> {noformat}
> But, unfortunately, the format of those hex bytes is not spelled out
> in the RTF spec ... the spec merely says the bytes are saved by the
> OLESaveToStream function ... and I haven't been able to find a
> description of what the bytes mean.
> In this case they are a "Package object" (\objclass Package), which I
> think is an [old?] way to wrap any non-OLE file (this is just a .txt
> file).
> Here's the hex dump:
> {noformat}
>   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
> 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
> 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
> 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
> 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt."|
> 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
> 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
> 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
> 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
> 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
> 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
> 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
> 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
> 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
> 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
> 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
> 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
> 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
> 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
> 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
> 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
> 0150  0

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-04-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962059#comment-13962059
 ] 

Tim Allison commented on TIKA-1010:
---

Hmmm... In the zip file from April 3, there should be three test files:

{noformat}
testRTFEmbeddedFiles.rtf
testRTFEmbeddedLink.rtf
testRTFRegularImages.rtf
{noformat}

As a triple check, I applied the patch from 4/7 to a fresh check out from 
trunk, and I put those three files in test-documents.  I had a build success 
(with embarrassing println left in RTFObjDataParser...the horror!...did I 
mention clean up and a few optimizations remain?).



> Embedded documents in RTF are not extracted
> ---
>
> Key: TIKA-1010
> URL: https://issues.apache.org/jira/browse/TIKA-1010
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Tim Allison
> Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, 
> outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip
>
>
> When an RTF doc embeds a doc it looks like this:
> {noformat}
> {\object\objemb
> \objw628\objh765{\*\objclass Package}{\*\objdata 
> 0105020008005061636b616765006600
> 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
> 5404bbfaee00080054044505
> 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
> 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
> {noformat}
> But, unfortunately, the format of those hex bytes is not spelled out
> in the RTF spec ... the spec merely says the bytes are saved by the
> OLESaveToStream function ... and I haven't been able to find a
> description of what the bytes mean.
> In this case they are a "Package object" (\objclass Package), which I
> think is an [old?] way to wrap any non-OLE file (this is just a .txt
> file).
> Here's the hex dump:
> {noformat}
>   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
> 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
> 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
> 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
> 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt."|
> 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
> 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
> 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
> 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
> 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
> 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
> 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
> 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
> 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
> 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
> 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
> 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
> 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
> 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
> 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
> 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
> 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
> 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |.."Syste|
> 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.&|
> 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
> 0190  01 01 00 03 00 00 00 00  00   |.|
> 0199
> {noformat}
> Anyway I have no idea how to decode the bytes at this point ... just
> opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1255) WordExtractor - bold hyperlink not closed properly

2014-04-07 Thread Alan Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Hunter updated TIKA-1255:
--

Attachment: WordParserTest.java
testWORD_strikethrough_hyperlink.doc
testWORD_italic_hyperlink.doc
testWORD_bold_hyperlink.doc
WordExtractor.java

Suggested patch, test and test documents

> WordExtractor - bold hyperlink not closed properly
> --
>
> Key: TIKA-1255
> URL: https://issues.apache.org/jira/browse/TIKA-1255
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2, 1.3, 1.4, 1.5
> Environment: Any
>Reporter: Alan Hunter
>Priority: Minor
> Attachments: WordExtractor.java, WordParserTest.java, example.doc, 
> testWORD_bold_hyperlink.doc, testWORD_italic_hyperlink.doc, 
> testWORD_strikethrough_hyperlink.doc
>
>
> If a Word document contains a bold hyperlink, the resulting xhtml is:
>  href="http://www.testdomain.com/support/workcentre-7232-7242/file-download/enus.html?operatingSystem=macosx108&fileLanguage=en&contentId=126220&from=downloads&viewArchived=false";>Test
>  link
> The closing bold and anchor tags are transposed, which isn't valid XHTML.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1255) WordExtractor - bold hyperlink not closed properly

2014-04-07 Thread Alan Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961914#comment-13961914
 ] 

Alan Hunter commented on TIKA-1255:
---

I have attached a suggested fix, test and test documents to improve the 
resilience of the Word parser when handling styled hyperlinks

> WordExtractor - bold hyperlink not closed properly
> --
>
> Key: TIKA-1255
> URL: https://issues.apache.org/jira/browse/TIKA-1255
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2, 1.3, 1.4, 1.5
> Environment: Any
>Reporter: Alan Hunter
>Priority: Minor
> Attachments: WordExtractor.java, WordParserTest.java, example.doc, 
> testWORD_bold_hyperlink.doc, testWORD_italic_hyperlink.doc, 
> testWORD_strikethrough_hyperlink.doc
>
>
> If a Word document contains a bold hyperlink, the resulting xhtml is:
>  href="http://www.testdomain.com/support/workcentre-7232-7242/file-download/enus.html?operatingSystem=macosx108&fileLanguage=en&contentId=126220&from=downloads&viewArchived=false";>Test
>  link
> The closing bold and anchor tags are transposed, which isn't valid XHTML.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-04-07 Thread Chris Bamford (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961821#comment-13961821
 ] 

Chris Bamford commented on TIKA-1010:
-

Thanks Tim, all compiles now.  
I think at least one test file is missing though (for RTFParserTest):
{code}
/test-documents/testRTFEmbeddedFiles.rtf
/test-documents/testRTFEmbeddedLink.rtf
{code}
I tried the "testRTF_embbededFiles.rtf" found in "testRTF_embbededFiles.zip" 
attached to this ticket as the former and that seems to work fine.  However, I 
cannot find anything suitable for the latter.
Please can you provide when you get a mo'?

I'm digging into the code now to see how it works  :-)

> Embedded documents in RTF are not extracted
> ---
>
> Key: TIKA-1010
> URL: https://issues.apache.org/jira/browse/TIKA-1010
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Tim Allison
> Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, 
> outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip
>
>
> When an RTF doc embeds a doc it looks like this:
> {noformat}
> {\object\objemb
> \objw628\objh765{\*\objclass Package}{\*\objdata 
> 0105020008005061636b616765006600
> 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
> 5404bbfaee00080054044505
> 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
> 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
> {noformat}
> But, unfortunately, the format of those hex bytes is not spelled out
> in the RTF spec ... the spec merely says the bytes are saved by the
> OLESaveToStream function ... and I haven't been able to find a
> description of what the bytes mean.
> In this case they are a "Package object" (\objclass Package), which I
> think is an [old?] way to wrap any non-OLE file (this is just a .txt
> file).
> Here's the hex dump:
> {noformat}
>   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
> 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
> 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
> 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
> 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt."|
> 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
> 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
> 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
> 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
> 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
> 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
> 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
> 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
> 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
> 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
> 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
> 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
> 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
> 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
> 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
> 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
> 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
> 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |.."Syste|
> 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.&|
> 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
> 0190  01 01 00 03 00 00 00 00  00   |.|
> 0199
> {noformat}
> Anyway I have no idea how to decode the bytes at this point ... just
> opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-04-07 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1010:
--

Attachment: TIKA-1010.patch

Doh!  With metadata class added...

> Embedded documents in RTF are not extracted
> ---
>
> Key: TIKA-1010
> URL: https://issues.apache.org/jira/browse/TIKA-1010
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Tim Allison
> Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, 
> outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip
>
>
> When an RTF doc embeds a doc it looks like this:
> {noformat}
> {\object\objemb
> \objw628\objh765{\*\objclass Package}{\*\objdata 
> 0105020008005061636b616765006600
> 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
> 5404bbfaee00080054044505
> 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
> 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
> {noformat}
> But, unfortunately, the format of those hex bytes is not spelled out
> in the RTF spec ... the spec merely says the bytes are saved by the
> OLESaveToStream function ... and I haven't been able to find a
> description of what the bytes mean.
> In this case they are a "Package object" (\objclass Package), which I
> think is an [old?] way to wrap any non-OLE file (this is just a .txt
> file).
> Here's the hex dump:
> {noformat}
>   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
> 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
> 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
> 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
> 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt."|
> 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
> 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
> 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
> 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
> 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
> 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
> 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
> 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
> 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
> 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
> 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
> 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
> 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
> 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
> 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
> 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
> 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
> 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |.."Syste|
> 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.&|
> 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
> 0190  01 01 00 03 00 00 00 00  00   |.|
> 0199
> {noformat}
> Anyway I have no idea how to decode the bytes at this point ... just
> opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)