[jira] [Commented] (TIKA-623) Add support for Outlook PST

2015-01-10 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272497#comment-14272497
 ] 

Tim Allison commented on TIKA-623:
--

Gah! Of course. Sorry and thank you. Should we modify the PSTParser so that it 
can take an EmbeddedParserDecorator? Inner class parser that would grab the 
mail object from ParseContext instead of handling the inputstream?

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2015-01-09 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272297#comment-14272297
 ] 

Luis Filipe Nassif commented on TIKA-623:
-

I think currently OutlookPSTParser does not extract .msg files, as they do not 
exist inside pst, mails are broken in several pieces. Looking at the source, it 
seems to extract/process raw text mail bodies and attachments, even if you set 
up the parsing to recurse down only one level.

And to get the relationship between a mail and its attachs, I think you will 
need to monitor the handler output currently. I think the parser could be 
improved to set a parent mail id into the metadata of its attachs and vice 
versa to make easier to recover the relationships.

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2015-01-09 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271534#comment-14271534
 ] 

Luis Filipe Nassif commented on TIKA-623:
-

Maybe the PSTParserTest can help: 
http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mbox/OutlookPSTParserTest.java

ParsingEmbeddedDocumentExtractor simply appends the contents of all mails 
together, so I think the hits will point to the PST file. You could override 
the parseEmbedded(...) method to extract individual mails and process (index) 
them separately, but I do not know how to do this with solr.

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2015-01-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271773#comment-14271773
 ] 

Tim Allison commented on TIKA-623:
--

[~lfcnassif]'s is the cleanest way to handle only going down one level, i.e. 
process each .msg file individually.

You could use Tika app's -z | --extract feature to extract all attachments 
before ingesting into Solr...that would be a preprocessing step before running 
Solr's DIH.  One problem with that approach is that embedded docs within an 
.msg file will be extracted into separate files...

Another option if you wanted to work on this programmatically would be to send 
via ParseContext a custom EmbeddedDocumentExtractor or a ParserDecorator.  
You'd have to be careful to ensure that it only goes down one level.  The 
default behavior would be to run that extractor/decorator against all embedded 
documents individually including attachments to .msg files, which you may or 
may not want.

Take a look at FileEmbeddedDocumentExtractor 
[http://svn.apache.org/viewvc/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java?revision=1633499view=markup|here]
 or MyEmbeddedDocumentExtractor 
[http://svn.apache.org/viewvc/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java?revision=1633499view=markup|here]

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2015-01-09 Thread Rangarajan (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271079#comment-14271079
 ] 

Rangarajan commented on TIKA-623:
-

 Can someone please give a complete working example of how to parse outlook 
mails, I want to integrate solr with tika and search outlook PST files

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2014-03-07 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923703#comment-13923703
 ] 

Hong-Thai Nguyen commented on TIKA-623:
---

[~lfcnassif], binary attached is handled with embeddedExtractor. BTW, I agree 
that we can split each mail to a separate unit.
[~talli...@apache.org], we couldn't fix .pst and .msg (msg is already handled 
as part of OfficeParser), and feel free to finish properly this issue as you 
can :)

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
Assignee: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2014-03-06 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922890#comment-13922890
 ] 

Luis Filipe Nassif commented on TIKA-623:
-

Good job. I think a possible improvement would be to generate a html for each 
email, containing its metadata and content, and call the embeddedExtractor to 
process the generated html, instead of printing all emails directly to 
xhtmlContentHandler.  So, in addition to attachments, emails could also be 
extracted from PST files if that is the goal of the application. What do you 
think?

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
Assignee: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2014-03-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922921#comment-13922921
 ] 

Tim Allison commented on TIKA-623:
--

Agreed.  Is there any way to reuse OutlookParser or to refactor so that we're 
using the same lib for an email, whether .pst or .msg.  There are lots of 
lessons learned embedded in the OutlookParser.  I'll be happy to chip in as I 
can.  [~thaichat04], thank you for getting this rolling!

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
Assignee: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2013-06-19 Thread Gary Gregory (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688262#comment-13688262
 ] 

Gary Gregory commented on TIKA-623:
---

Did anyone ever push java-libpst to Maven Central? Searching for 'java-libpst' 
yields 0 results.

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-12-01 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13160919#comment-13160919
 ] 

Jukka Zitting commented on TIKA-623:


bq. Is there some way to proceed here without requiring libpst be mavenized?

Certainly. The only thing we'd need is to have the library available as a 
dependency on the central repository (otherwise we can't push out a Tika 
release with such a dependency). This requires no changes to the upstream 
library, just some extra metadata and appropriate -sources and -javadoc jars to 
accompany to the upload. See 
https://docs.sonatype.org/display/Repository/Uploading+3rd-party+Artifacts+to+The+Central+Repository
 for details.

Anyone can volunteer to take care of this. See for example 
https://groups.google.com/d/topic/tagsoup-friends/vIUe_jSR5YQ/discussion for a 
thread where I volunteered and did this for a recent release of the TagSoup 
library.

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-10-04 Thread Mark Kerzner (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13120648#comment-13120648
 ] 

Mark Kerzner commented on TIKA-623:
---

Hi, everybody,

I have forked Richard Johnson's java-libpst project here on GitHub 
https://github.com/markkerzner/JavaLibpst. My reasons for doing this are as 
follows:

1. I need java-libpst parsing capabilities for my FreeEed project 
https://github.com/markkerzner/FreeEed
2. I want it in Maven, for FreeEed's purposes, and later on I would be happy to 
see it included in Tika, which also needs it in Maven;
3. I want it in active development, and Richard told me that he has less time 
for it than before.
4. By no means do I want to take the glory or the project away from Richard, 
but it is one of the keys for FreeEed's adoption in Windows.

I am in touch with Richard on all that, but I want the community feedback. 
Should I continue? Should I bring it into some Maven repository? I have been 
working with Carl Byington and know his libpst somewhat, so that additional 
qualification should help. Therefore, please, how am I to proceed?

Thank you.



 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-04-03 Thread Tran Nam Quang (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015152#comment-13015152
 ] 

Tran Nam Quang commented on TIKA-623:
-

I started work on the Tika parser, but got stuck with the following problem: In 
order to access the Outlook PST file, I need to create a PSTFile instance. Now, 
the PSTFile constructor requires either a File or a String argument that points 
at the PST file. The constructor then takes either of these to create a 
RandomAccessFile internally. However, Tika's Parser interface gives me an 
InputStream. What do I do?

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang

 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-04-03 Thread Richard Johnson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015160#comment-13015160
 ] 

Richard Johnson commented on TIKA-623:
--

getDescriptorNodeId() is most likely the one you want for a unique identifier.  
They are for internal use, however they are guaranteed unique per PST file and 
are unchanging (incrementally allocated and not reused).

Internet Message Ids are the ones from rfc2822, and therefore not all PST 
objects (such as unsent emails) have them.

I'll get this updated in the javadocs.

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang

 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-04-03 Thread Richard Johnson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015164#comment-13015164
 ] 

Richard Johnson commented on TIKA-623:
--

I'll start working on getting the library into Maven Central, thanks for those 
links Nick.

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang

 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-04-03 Thread Tran Nam Quang (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015171#comment-13015171
 ] 

Tran Nam Quang commented on TIKA-623:
-

The PST file is basically a folder tree with emails and other stuff in it. Is 
there some sort of specification out there that tells me how to map this tree 
to specific XHTML elements?

More specifically, what XML tags should I use to separate the emails from one 
another? And should the output be just a linear stream of emails, or should the 
tree structure be included in the output as well?

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang

 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-04-02 Thread Richard Johnson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015043#comment-13015043
 ] 

Richard Johnson commented on TIKA-623:
--

Hey Guys,

I've just uploaded a new version with some cleanups, bug fixes and most 
importantly a new License.

Kind Regards,

Richard

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang

 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-04-02 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015045#comment-13015045
 ] 

Nick Burch commented on TIKA-623:
-

Great news Richard. 

Are you happy to start the process of getting the new release into Maven 
Central? The process should be largely the same as Ken did with TIKA-462, and 
Sonatype seem to have a very handy walkthrough of the process at 
https://docs.sonatype.org/display/Repository/Sonatype+OSS+Maven+Repository+Usage+Guide

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang

 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-04-02 Thread Tran Nam Quang (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13015046#comment-13015046
 ] 

Tran Nam Quang commented on TIKA-623:
-

Cool! I'll start writing the Tika parser as soon as I can. Could take a couple 
of days though.

Richard, I have one question regarding the API: PSTMessage has two methods, 
getDescriptorNodeId() and getInternetMessageId(). Both return identifiers, 
apparently. My question is: Which one is an unique identifier that will never, 
ever change? Cause I wouldn't want the Tika parser to extract identifiers that 
are internal-only and not unique.

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang

 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-03-30 Thread Tran Nam Quang (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013645#comment-13013645
 ] 

Tran Nam Quang commented on TIKA-623:
-

I contacted the library author, he agreed to dual-licensing the library as 
LGPL/Apache. I hope this clears up the licensing issues.

As for the Tika parser, I won't be able to implement that before Saturday or 
Sunday (assuming I'm still supposed to).

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang

 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-03-30 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013657#comment-13013657
 ] 

Nick Burch commented on TIKA-623:
-

The re-license is great news! There are two steps needed then:
* Get a version of libpst into Maven Central (so we can include it as a 
dependency)
* Write a Parser which uses libpst, likely one that does all the metadata bits 
and delegates to other parsers for the message body + attachments

For the former, see something like TIKA-407 for a guide. For the latter, I'd 
suggest cribbing off something like PackageParser and the Outlook Parser

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang

 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-03-30 Thread Tran Nam Quang (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013673#comment-13013673
 ] 

Tran Nam Quang commented on TIKA-623:
-

I have zero experience with Maven, so I don't think I'm the right person to 
take care of the Maven upload.

I might be able to handle the Parser, although it'll probably have to wait 
until the library author makes a new relicensed release available.

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang

 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-03-29 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012730#comment-13012730
 ] 

Nick Burch commented on TIKA-623:
-

If it's LGPL then we can't include it in Tika as standard

However, it is possible to have the parser dynamically loaded if a user chooses 
to download the parser + dependent files (if the license works for them)

If you're interested in pst support, then I'd suggest you try to knock up a 
basic parser using libpst. If you do get it working, please list it on the wiki:
   http://wiki.apache.org/tika/3rd%20party%20parser%20plugins

If you need help with developing the plugin, please ask on the dev list. You 
might also be interested in looking at the relatively small patch that was all 
that was required to enable JTNEF (GPL) to be used as a Tika plugin:
   
https://github.com/jukka/jtnef/commit/a9a51982165101c0bdda4cb5266d7f8958c271ef

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang

 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-03-29 Thread Tran Nam Quang (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012744#comment-13012744
 ] 

Tran Nam Quang commented on TIKA-623:
-

What license is required for inclusion in Tika, other than the Apache License 
2.0? I could ask the author to change the license or switch to dual-licensing...

The basic parser is already listed as an example on the front page of the 
java-libpst website, by the way.

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang

 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2011-03-29 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012746#comment-13012746
 ] 

Uwe Schindler commented on TIKA-623:


From looking at the code of this library, it looks that it needs some 
improvements/fixes:
- It catches all exceptions and instead of simply wrap'n'rethrow or declare the 
checked exceptions in the methods, it prints the stack trace to System.out. 
Also messages are printed to System.out.
- The RTF compression decoder uses new String(byte[]) without charset - locale 
dependent! Other places do this, too. This is broken, as the file format should 
define the charset.


 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang

 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira