[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-03-30 Thread Tran Nam Quang (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013673#comment-13013673
 ] 

Tran Nam Quang commented on TIKA-623:
-

I have zero experience with Maven, so I don't think I'm the right person to 
take care of the Maven upload.

I might be able to handle the Parser, although it'll probably have to wait 
until the library author makes a new relicensed release available.

> Add support for Outlook PST
> ---
>
> Key: TIKA-623
> URL: https://issues.apache.org/jira/browse/TIKA-623
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Tran Nam Quang
>
> Hello everyone,
> As you might know, Outlook stores its mails and other stuff in a single PST 
> file. There's a relatively new Java library called java-libpst for reading 
> Outlook PST files. It is licensed under the LGPL and available over here: 
> http://code.google.com/p/java-libpst/
> I have tested the library on Outlook 2000 and Outlook 2003, with good 
> results. It would be great if the library could be integrated into Tika.
> Best regards
> Tran Nam Quang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-03-30 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013657#comment-13013657
 ] 

Nick Burch commented on TIKA-623:
-

The re-license is great news! There are two steps needed then:
* Get a version of libpst into Maven Central (so we can include it as a 
dependency)
* Write a Parser which uses libpst, likely one that does all the metadata bits 
and delegates to other parsers for the message body + attachments

For the former, see something like TIKA-407 for a guide. For the latter, I'd 
suggest cribbing off something like PackageParser and the Outlook Parser

> Add support for Outlook PST
> ---
>
> Key: TIKA-623
> URL: https://issues.apache.org/jira/browse/TIKA-623
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Tran Nam Quang
>
> Hello everyone,
> As you might know, Outlook stores its mails and other stuff in a single PST 
> file. There's a relatively new Java library called java-libpst for reading 
> Outlook PST files. It is licensed under the LGPL and available over here: 
> http://code.google.com/p/java-libpst/
> I have tested the library on Outlook 2000 and Outlook 2003, with good 
> results. It would be great if the library could be integrated into Tika.
> Best regards
> Tran Nam Quang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-626) Add a BaseParser class

2011-03-30 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013653#comment-13013653
 ] 

Nick Burch commented on TIKA-626:
-

Or AbstractParser? The idea seems good to me though!

> Add a BaseParser class
> --
>
> Key: TIKA-626
> URL: https://issues.apache.org/jira/browse/TIKA-626
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Jukka Zitting
>Priority: Minor
>
> The deprecated parse() method in the Parser interface causes quite a few 
> repetitive method declarations in all parser classes, so to simplify things 
> I'd like to introduce a BaseParser class that provides reasonable default 
> implementations of all Parser methods and can be used as the base class of 
> any Parser implementation. It would be analogous to the DefaultHandler class 
> in SAX, and would also make it easier for us to introduce new Parser methods 
> if needed later on without necessarily breaking too many existing Parser 
> classes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Issue Comment Edited] (TIKA-623) Add support for Outlook PST

2011-03-30 Thread Tran Nam Quang (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013645#comment-13013645
 ] 

Tran Nam Quang edited comment on TIKA-623 at 3/30/11 9:03 PM:
--

I contacted the library author, he agreed to dual-licensing the library as 
LGPL/Apache. This means java-libpst can be included by default in Tika, right?

As for the Tika parser, I won't be able to implement that before Saturday or 
Sunday (assuming I'm still supposed to).

  was (Author: qforce):
I contacted the library author, he agreed to dual-licensing the library as 
LGPL/Apache. I hope this clears up the licensing issues.

As for the Tika parser, I won't be able to implement that before Saturday or 
Sunday (assuming I'm still supposed to).
  
> Add support for Outlook PST
> ---
>
> Key: TIKA-623
> URL: https://issues.apache.org/jira/browse/TIKA-623
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Tran Nam Quang
>
> Hello everyone,
> As you might know, Outlook stores its mails and other stuff in a single PST 
> file. There's a relatively new Java library called java-libpst for reading 
> Outlook PST files. It is licensed under the LGPL and available over here: 
> http://code.google.com/p/java-libpst/
> I have tested the library on Outlook 2000 and Outlook 2003, with good 
> results. It would be great if the library could be integrated into Tika.
> Best regards
> Tran Nam Quang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-03-30 Thread Tran Nam Quang (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013645#comment-13013645
 ] 

Tran Nam Quang commented on TIKA-623:
-

I contacted the library author, he agreed to dual-licensing the library as 
LGPL/Apache. I hope this clears up the licensing issues.

As for the Tika parser, I won't be able to implement that before Saturday or 
Sunday (assuming I'm still supposed to).

> Add support for Outlook PST
> ---
>
> Key: TIKA-623
> URL: https://issues.apache.org/jira/browse/TIKA-623
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Tran Nam Quang
>
> Hello everyone,
> As you might know, Outlook stores its mails and other stuff in a single PST 
> file. There's a relatively new Java library called java-libpst for reading 
> Outlook PST files. It is licensed under the LGPL and available over here: 
> http://code.google.com/p/java-libpst/
> I have tested the library on Outlook 2000 and Outlook 2003, with good 
> results. It would be great if the library could be integrated into Tika.
> Best regards
> Tran Nam Quang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-626) Add a BaseParser class

2011-03-30 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting updated TIKA-626:
---

Description: The deprecated parse() method in the Parser interface causes 
quite a few repetitive method declarations in all parser classes, so to 
simplify things I'd like to introduce a BaseParser class that provides 
reasonable default implementations of all Parser methods and can be used as the 
base class of any Parser implementation. It would be analogous to the 
DefaultHandler class in SAX, and would also make it easier for us to introduce 
new Parser methods if needed later on without necessarily breaking too many 
existing Parser classes.  (was: The deprecated parse() method in the Parser 
interface causes quite a few repetitive method declarations in all parser 
classes, so to simplify things I'd like to introduce a DefaultParser class that 
provides reasonable default implementations of all Parser methods and can be 
used as the base class of any Parser implementation. It would be analogous to 
the DefaultHandler class in SAX, and would also make it easier for us to 
introduce new Parser methods if needed later on without necessarily breaking 
too many existing Parser classes.)
Summary: Add a BaseParser class  (was: Add a DefaultParser class)

Hmm, we already have a DefaultParser class. Renamed this issue to BaseParser.

> Add a BaseParser class
> --
>
> Key: TIKA-626
> URL: https://issues.apache.org/jira/browse/TIKA-626
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Jukka Zitting
>Priority: Minor
>
> The deprecated parse() method in the Parser interface causes quite a few 
> repetitive method declarations in all parser classes, so to simplify things 
> I'd like to introduce a BaseParser class that provides reasonable default 
> implementations of all Parser methods and can be used as the base class of 
> any Parser implementation. It would be analogous to the DefaultHandler class 
> in SAX, and would also make it easier for us to introduce new Parser methods 
> if needed later on without necessarily breaking too many existing Parser 
> classes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-626) Add a DefaultParser class

2011-03-30 Thread Jukka Zitting (JIRA)
Add a DefaultParser class
-

 Key: TIKA-626
 URL: https://issues.apache.org/jira/browse/TIKA-626
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Jukka Zitting
Assignee: Jukka Zitting
Priority: Minor


The deprecated parse() method in the Parser interface causes quite a few 
repetitive method declarations in all parser classes, so to simplify things I'd 
like to introduce a DefaultParser class that provides reasonable default 
implementations of all Parser methods and can be used as the base class of any 
Parser implementation. It would be analogous to the DefaultHandler class in 
SAX, and would also make it easier for us to introduce new Parser methods if 
needed later on without necessarily breaking too many existing Parser classes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-625) Easier XML parser extensibility

2011-03-30 Thread Jukka Zitting (JIRA)
Easier XML parser extensibility
---

 Key: TIKA-625
 URL: https://issues.apache.org/jira/browse/TIKA-625
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Jukka Zitting
Assignee: Jukka Zitting
Priority: Minor


The DcXMLParser class uses our streaming XPath mechanism to locate Dublin Core 
elements from a stream of SAX events. While powerful, that mechanism is a bit 
cumbersome to use for simple use cases where you'd just want to map the 
contents of a specific XML element or attribute into a metadata field. To make 
this simpler (and to remove the XPath processing overhead), I'd like to add new 
Attribute- and ElementMetadataHandler utility classes that focus on this 
specific use case.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-624) Fix parsers list in the 0.7 - 0.9 websites

2011-03-30 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-624.
-

Resolution: Fixed

0.8 and 0.9 pages updated

> Fix parsers list in the 0.7 - 0.9 websites
> --
>
> Key: TIKA-624
> URL: https://issues.apache.org/jira/browse/TIKA-624
> Project: Tika
>  Issue Type: Task
>  Components: documentation
>Reporter: Nick Burch
>Assignee: Nick Burch
>
> As noticed on the mailing list a few days back, the parsers and supported 
> formats list for 0.6 onwards all say "This page lists all the document 
> formats supported by Apache Tika 0.6.". Need to review the pages for 0.7 
> onwards and tweak as required.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-624) Fix parsers list in the 0.7 - 0.9 websites

2011-03-30 Thread Nick Burch (JIRA)
Fix parsers list in the 0.7 - 0.9 websites
--

 Key: TIKA-624
 URL: https://issues.apache.org/jira/browse/TIKA-624
 Project: Tika
  Issue Type: Task
  Components: documentation
Reporter: Nick Burch
Assignee: Nick Burch


As noticed on the mailing list a few days back, the parsers and supported 
formats list for 0.6 onwards all say "This page lists all the document formats 
supported by Apache Tika 0.6.". Need to review the pages for 0.7 onwards and 
tweak as required.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira