[jira] [Updated] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-623: -- Assignee: (was: Hong-Thai Nguyen) Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959978#comment-13959978 ] Chris Bamford commented on TIKA-1010: - Hi Tim Am about to play with the patch - which version of Tika does it apply to? Presumably I download that and then apply the patch? Thanks - Chris Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, TIKA-1010_patch.zip, outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959988#comment-13959988 ] Tim Allison commented on TIKA-1010: --- trunk svn co http://svn.apache.org/repos/asf/tika/trunk Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, TIKA-1010_patch.zip, outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)