[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013673#comment-13013673 ] Tran Nam Quang commented on TIKA-623: - I have zero experience with Maven, so I don't think I'm the right person to take care of the Maven upload. I might be able to handle the Parser, although it'll probably have to wait until the library author makes a new relicensed release available. > Add support for Outlook PST > --- > > Key: TIKA-623 > URL: https://issues.apache.org/jira/browse/TIKA-623 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Tran Nam Quang > > Hello everyone, > As you might know, Outlook stores its mails and other stuff in a single PST > file. There's a relatively new Java library called java-libpst for reading > Outlook PST files. It is licensed under the LGPL and available over here: > http://code.google.com/p/java-libpst/ > I have tested the library on Outlook 2000 and Outlook 2003, with good > results. It would be great if the library could be integrated into Tika. > Best regards > Tran Nam Quang -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013657#comment-13013657 ] Nick Burch commented on TIKA-623: - The re-license is great news! There are two steps needed then: * Get a version of libpst into Maven Central (so we can include it as a dependency) * Write a Parser which uses libpst, likely one that does all the metadata bits and delegates to other parsers for the message body + attachments For the former, see something like TIKA-407 for a guide. For the latter, I'd suggest cribbing off something like PackageParser and the Outlook Parser > Add support for Outlook PST > --- > > Key: TIKA-623 > URL: https://issues.apache.org/jira/browse/TIKA-623 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Tran Nam Quang > > Hello everyone, > As you might know, Outlook stores its mails and other stuff in a single PST > file. There's a relatively new Java library called java-libpst for reading > Outlook PST files. It is licensed under the LGPL and available over here: > http://code.google.com/p/java-libpst/ > I have tested the library on Outlook 2000 and Outlook 2003, with good > results. It would be great if the library could be integrated into Tika. > Best regards > Tran Nam Quang -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-626) Add a BaseParser class
[ https://issues.apache.org/jira/browse/TIKA-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013653#comment-13013653 ] Nick Burch commented on TIKA-626: - Or AbstractParser? The idea seems good to me though! > Add a BaseParser class > -- > > Key: TIKA-626 > URL: https://issues.apache.org/jira/browse/TIKA-626 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Jukka Zitting >Assignee: Jukka Zitting >Priority: Minor > > The deprecated parse() method in the Parser interface causes quite a few > repetitive method declarations in all parser classes, so to simplify things > I'd like to introduce a BaseParser class that provides reasonable default > implementations of all Parser methods and can be used as the base class of > any Parser implementation. It would be analogous to the DefaultHandler class > in SAX, and would also make it easier for us to introduce new Parser methods > if needed later on without necessarily breaking too many existing Parser > classes. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013645#comment-13013645 ] Tran Nam Quang edited comment on TIKA-623 at 3/30/11 9:03 PM: -- I contacted the library author, he agreed to dual-licensing the library as LGPL/Apache. This means java-libpst can be included by default in Tika, right? As for the Tika parser, I won't be able to implement that before Saturday or Sunday (assuming I'm still supposed to). was (Author: qforce): I contacted the library author, he agreed to dual-licensing the library as LGPL/Apache. I hope this clears up the licensing issues. As for the Tika parser, I won't be able to implement that before Saturday or Sunday (assuming I'm still supposed to). > Add support for Outlook PST > --- > > Key: TIKA-623 > URL: https://issues.apache.org/jira/browse/TIKA-623 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Tran Nam Quang > > Hello everyone, > As you might know, Outlook stores its mails and other stuff in a single PST > file. There's a relatively new Java library called java-libpst for reading > Outlook PST files. It is licensed under the LGPL and available over here: > http://code.google.com/p/java-libpst/ > I have tested the library on Outlook 2000 and Outlook 2003, with good > results. It would be great if the library could be integrated into Tika. > Best regards > Tran Nam Quang -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013645#comment-13013645 ] Tran Nam Quang commented on TIKA-623: - I contacted the library author, he agreed to dual-licensing the library as LGPL/Apache. I hope this clears up the licensing issues. As for the Tika parser, I won't be able to implement that before Saturday or Sunday (assuming I'm still supposed to). > Add support for Outlook PST > --- > > Key: TIKA-623 > URL: https://issues.apache.org/jira/browse/TIKA-623 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Tran Nam Quang > > Hello everyone, > As you might know, Outlook stores its mails and other stuff in a single PST > file. There's a relatively new Java library called java-libpst for reading > Outlook PST files. It is licensed under the LGPL and available over here: > http://code.google.com/p/java-libpst/ > I have tested the library on Outlook 2000 and Outlook 2003, with good > results. It would be great if the library could be integrated into Tika. > Best regards > Tran Nam Quang -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-626) Add a BaseParser class
[ https://issues.apache.org/jira/browse/TIKA-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting updated TIKA-626: --- Description: The deprecated parse() method in the Parser interface causes quite a few repetitive method declarations in all parser classes, so to simplify things I'd like to introduce a BaseParser class that provides reasonable default implementations of all Parser methods and can be used as the base class of any Parser implementation. It would be analogous to the DefaultHandler class in SAX, and would also make it easier for us to introduce new Parser methods if needed later on without necessarily breaking too many existing Parser classes. (was: The deprecated parse() method in the Parser interface causes quite a few repetitive method declarations in all parser classes, so to simplify things I'd like to introduce a DefaultParser class that provides reasonable default implementations of all Parser methods and can be used as the base class of any Parser implementation. It would be analogous to the DefaultHandler class in SAX, and would also make it easier for us to introduce new Parser methods if needed later on without necessarily breaking too many existing Parser classes.) Summary: Add a BaseParser class (was: Add a DefaultParser class) Hmm, we already have a DefaultParser class. Renamed this issue to BaseParser. > Add a BaseParser class > -- > > Key: TIKA-626 > URL: https://issues.apache.org/jira/browse/TIKA-626 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Jukka Zitting >Assignee: Jukka Zitting >Priority: Minor > > The deprecated parse() method in the Parser interface causes quite a few > repetitive method declarations in all parser classes, so to simplify things > I'd like to introduce a BaseParser class that provides reasonable default > implementations of all Parser methods and can be used as the base class of > any Parser implementation. It would be analogous to the DefaultHandler class > in SAX, and would also make it easier for us to introduce new Parser methods > if needed later on without necessarily breaking too many existing Parser > classes. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-626) Add a DefaultParser class
Add a DefaultParser class - Key: TIKA-626 URL: https://issues.apache.org/jira/browse/TIKA-626 Project: Tika Issue Type: Improvement Components: parser Reporter: Jukka Zitting Assignee: Jukka Zitting Priority: Minor The deprecated parse() method in the Parser interface causes quite a few repetitive method declarations in all parser classes, so to simplify things I'd like to introduce a DefaultParser class that provides reasonable default implementations of all Parser methods and can be used as the base class of any Parser implementation. It would be analogous to the DefaultHandler class in SAX, and would also make it easier for us to introduce new Parser methods if needed later on without necessarily breaking too many existing Parser classes. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-625) Easier XML parser extensibility
Easier XML parser extensibility --- Key: TIKA-625 URL: https://issues.apache.org/jira/browse/TIKA-625 Project: Tika Issue Type: Improvement Components: parser Reporter: Jukka Zitting Assignee: Jukka Zitting Priority: Minor The DcXMLParser class uses our streaming XPath mechanism to locate Dublin Core elements from a stream of SAX events. While powerful, that mechanism is a bit cumbersome to use for simple use cases where you'd just want to map the contents of a specific XML element or attribute into a metadata field. To make this simpler (and to remove the XPath processing overhead), I'd like to add new Attribute- and ElementMetadataHandler utility classes that focus on this specific use case. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-624) Fix parsers list in the 0.7 - 0.9 websites
[ https://issues.apache.org/jira/browse/TIKA-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-624. - Resolution: Fixed 0.8 and 0.9 pages updated > Fix parsers list in the 0.7 - 0.9 websites > -- > > Key: TIKA-624 > URL: https://issues.apache.org/jira/browse/TIKA-624 > Project: Tika > Issue Type: Task > Components: documentation >Reporter: Nick Burch >Assignee: Nick Burch > > As noticed on the mailing list a few days back, the parsers and supported > formats list for 0.6 onwards all say "This page lists all the document > formats supported by Apache Tika 0.6.". Need to review the pages for 0.7 > onwards and tweak as required. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-624) Fix parsers list in the 0.7 - 0.9 websites
Fix parsers list in the 0.7 - 0.9 websites -- Key: TIKA-624 URL: https://issues.apache.org/jira/browse/TIKA-624 Project: Tika Issue Type: Task Components: documentation Reporter: Nick Burch Assignee: Nick Burch As noticed on the mailing list a few days back, the parsers and supported formats list for 0.6 onwards all say "This page lists all the document formats supported by Apache Tika 0.6.". Need to review the pages for 0.7 onwards and tweak as required. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira