[jira] [Updated] (TIKA-623) Add support for Outlook PST

2014-04-04 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-623:
--

Assignee: (was: Hong-Thai Nguyen)

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-04-04 Thread Chris Bamford (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959978#comment-13959978
 ] 

Chris Bamford commented on TIKA-1010:
-

Hi Tim

Am about to play with the patch - which version of Tika does it apply to?  
Presumably I download that and then apply the patch?

Thanks

- Chris

 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, TIKA-1010_patch.zip, outer.rtf, 
 testRTFRegularImages.rtf, testRTF_embbededFiles.zip


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-04-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959988#comment-13959988
 ] 

Tim Allison commented on TIKA-1010:
---

trunk

svn co http://svn.apache.org/repos/asf/tika/trunk

 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, TIKA-1010_patch.zip, outer.rtf, 
 testRTFRegularImages.rtf, testRTF_embbededFiles.zip


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)