[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272497#comment-14272497 ] Tim Allison commented on TIKA-623: -- Gah! Of course. Sorry and thank you. Should we modify the PSTParser so that it can take an EmbeddedParserDecorator? Inner class parser that would grab the mail object from ParseContext instead of handling the inputstream? Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272297#comment-14272297 ] Luis Filipe Nassif commented on TIKA-623: - I think currently OutlookPSTParser does not extract .msg files, as they do not exist inside pst, mails are broken in several pieces. Looking at the source, it seems to extract/process raw text mail bodies and attachments, even if you set up the parsing to recurse down only one level. And to get the relationship between a mail and its attachs, I think you will need to monitor the handler output currently. I think the parser could be improved to set a parent mail id into the metadata of its attachs and vice versa to make easier to recover the relationships. Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271534#comment-14271534 ] Luis Filipe Nassif commented on TIKA-623: - Maybe the PSTParserTest can help: http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mbox/OutlookPSTParserTest.java ParsingEmbeddedDocumentExtractor simply appends the contents of all mails together, so I think the hits will point to the PST file. You could override the parseEmbedded(...) method to extract individual mails and process (index) them separately, but I do not know how to do this with solr. Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271773#comment-14271773 ] Tim Allison commented on TIKA-623: -- [~lfcnassif]'s is the cleanest way to handle only going down one level, i.e. process each .msg file individually. You could use Tika app's -z | --extract feature to extract all attachments before ingesting into Solr...that would be a preprocessing step before running Solr's DIH. One problem with that approach is that embedded docs within an .msg file will be extracted into separate files... Another option if you wanted to work on this programmatically would be to send via ParseContext a custom EmbeddedDocumentExtractor or a ParserDecorator. You'd have to be careful to ensure that it only goes down one level. The default behavior would be to run that extractor/decorator against all embedded documents individually including attachments to .msg files, which you may or may not want. Take a look at FileEmbeddedDocumentExtractor [http://svn.apache.org/viewvc/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java?revision=1633499view=markup|here] or MyEmbeddedDocumentExtractor [http://svn.apache.org/viewvc/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java?revision=1633499view=markup|here] Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271079#comment-14271079 ] Rangarajan commented on TIKA-623: - Can someone please give a complete working example of how to parse outlook mails, I want to integrate solr with tika and search outlook PST files Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923703#comment-13923703 ] Hong-Thai Nguyen commented on TIKA-623: --- [~lfcnassif], binary attached is handled with embeddedExtractor. BTW, I agree that we can split each mail to a separate unit. [~talli...@apache.org], we couldn't fix .pst and .msg (msg is already handled as part of OfficeParser), and feel free to finish properly this issue as you can :) Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Assignee: Hong-Thai Nguyen Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922890#comment-13922890 ] Luis Filipe Nassif commented on TIKA-623: - Good job. I think a possible improvement would be to generate a html for each email, containing its metadata and content, and call the embeddedExtractor to process the generated html, instead of printing all emails directly to xhtmlContentHandler. So, in addition to attachments, emails could also be extracted from PST files if that is the goal of the application. What do you think? Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Assignee: Hong-Thai Nguyen Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922921#comment-13922921 ] Tim Allison commented on TIKA-623: -- Agreed. Is there any way to reuse OutlookParser or to refactor so that we're using the same lib for an email, whether .pst or .msg. There are lots of lessons learned embedded in the OutlookParser. I'll be happy to chip in as I can. [~thaichat04], thank you for getting this rolling! Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Assignee: Hong-Thai Nguyen Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688262#comment-13688262 ] Gary Gregory commented on TIKA-623: --- Did anyone ever push java-libpst to Maven Central? Searching for 'java-libpst' yields 0 results. Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13160919#comment-13160919 ] Jukka Zitting commented on TIKA-623: bq. Is there some way to proceed here without requiring libpst be mavenized? Certainly. The only thing we'd need is to have the library available as a dependency on the central repository (otherwise we can't push out a Tika release with such a dependency). This requires no changes to the upstream library, just some extra metadata and appropriate -sources and -javadoc jars to accompany to the upload. See https://docs.sonatype.org/display/Repository/Uploading+3rd-party+Artifacts+to+The+Central+Repository for details. Anyone can volunteer to take care of this. See for example https://groups.google.com/d/topic/tagsoup-friends/vIUe_jSR5YQ/discussion for a thread where I volunteered and did this for a recent release of the TagSoup library. Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13120648#comment-13120648 ] Mark Kerzner commented on TIKA-623: --- Hi, everybody, I have forked Richard Johnson's java-libpst project here on GitHub https://github.com/markkerzner/JavaLibpst. My reasons for doing this are as follows: 1. I need java-libpst parsing capabilities for my FreeEed project https://github.com/markkerzner/FreeEed 2. I want it in Maven, for FreeEed's purposes, and later on I would be happy to see it included in Tika, which also needs it in Maven; 3. I want it in active development, and Richard told me that he has less time for it than before. 4. By no means do I want to take the glory or the project away from Richard, but it is one of the keys for FreeEed's adoption in Windows. I am in touch with Richard on all that, but I want the community feedback. Should I continue? Should I bring it into some Maven repository? I have been working with Carl Byington and know his libpst somewhat, so that additional qualification should help. Therefore, please, how am I to proceed? Thank you. Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015152#comment-13015152 ] Tran Nam Quang commented on TIKA-623: - I started work on the Tika parser, but got stuck with the following problem: In order to access the Outlook PST file, I need to create a PSTFile instance. Now, the PSTFile constructor requires either a File or a String argument that points at the PST file. The constructor then takes either of these to create a RandomAccessFile internally. However, Tika's Parser interface gives me an InputStream. What do I do? Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015160#comment-13015160 ] Richard Johnson commented on TIKA-623: -- getDescriptorNodeId() is most likely the one you want for a unique identifier. They are for internal use, however they are guaranteed unique per PST file and are unchanging (incrementally allocated and not reused). Internet Message Ids are the ones from rfc2822, and therefore not all PST objects (such as unsent emails) have them. I'll get this updated in the javadocs. Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015164#comment-13015164 ] Richard Johnson commented on TIKA-623: -- I'll start working on getting the library into Maven Central, thanks for those links Nick. Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015171#comment-13015171 ] Tran Nam Quang commented on TIKA-623: - The PST file is basically a folder tree with emails and other stuff in it. Is there some sort of specification out there that tells me how to map this tree to specific XHTML elements? More specifically, what XML tags should I use to separate the emails from one another? And should the output be just a linear stream of emails, or should the tree structure be included in the output as well? Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015043#comment-13015043 ] Richard Johnson commented on TIKA-623: -- Hey Guys, I've just uploaded a new version with some cleanups, bug fixes and most importantly a new License. Kind Regards, Richard Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015045#comment-13015045 ] Nick Burch commented on TIKA-623: - Great news Richard. Are you happy to start the process of getting the new release into Maven Central? The process should be largely the same as Ken did with TIKA-462, and Sonatype seem to have a very handy walkthrough of the process at https://docs.sonatype.org/display/Repository/Sonatype+OSS+Maven+Repository+Usage+Guide Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015046#comment-13015046 ] Tran Nam Quang commented on TIKA-623: - Cool! I'll start writing the Tika parser as soon as I can. Could take a couple of days though. Richard, I have one question regarding the API: PSTMessage has two methods, getDescriptorNodeId() and getInternetMessageId(). Both return identifiers, apparently. My question is: Which one is an unique identifier that will never, ever change? Cause I wouldn't want the Tika parser to extract identifiers that are internal-only and not unique. Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013645#comment-13013645 ] Tran Nam Quang commented on TIKA-623: - I contacted the library author, he agreed to dual-licensing the library as LGPL/Apache. I hope this clears up the licensing issues. As for the Tika parser, I won't be able to implement that before Saturday or Sunday (assuming I'm still supposed to). Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013657#comment-13013657 ] Nick Burch commented on TIKA-623: - The re-license is great news! There are two steps needed then: * Get a version of libpst into Maven Central (so we can include it as a dependency) * Write a Parser which uses libpst, likely one that does all the metadata bits and delegates to other parsers for the message body + attachments For the former, see something like TIKA-407 for a guide. For the latter, I'd suggest cribbing off something like PackageParser and the Outlook Parser Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013673#comment-13013673 ] Tran Nam Quang commented on TIKA-623: - I have zero experience with Maven, so I don't think I'm the right person to take care of the Maven upload. I might be able to handle the Parser, although it'll probably have to wait until the library author makes a new relicensed release available. Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012730#comment-13012730 ] Nick Burch commented on TIKA-623: - If it's LGPL then we can't include it in Tika as standard However, it is possible to have the parser dynamically loaded if a user chooses to download the parser + dependent files (if the license works for them) If you're interested in pst support, then I'd suggest you try to knock up a basic parser using libpst. If you do get it working, please list it on the wiki: http://wiki.apache.org/tika/3rd%20party%20parser%20plugins If you need help with developing the plugin, please ask on the dev list. You might also be interested in looking at the relatively small patch that was all that was required to enable JTNEF (GPL) to be used as a Tika plugin: https://github.com/jukka/jtnef/commit/a9a51982165101c0bdda4cb5266d7f8958c271ef Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012744#comment-13012744 ] Tran Nam Quang commented on TIKA-623: - What license is required for inclusion in Tika, other than the Apache License 2.0? I could ask the author to change the license or switch to dual-licensing... The basic parser is already listed as an example on the front page of the java-libpst website, by the way. Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012746#comment-13012746 ] Uwe Schindler commented on TIKA-623: From looking at the code of this library, it looks that it needs some improvements/fixes: - It catches all exceptions and instead of simply wrap'n'rethrow or declare the checked exceptions in the methods, it prints the stack trace to System.out. Also messages are printed to System.out. - The RTF compression decoder uses new String(byte[]) without charset - locale dependent! Other places do this, too. This is broken, as the file format should define the charset. Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira