[jira] [Updated] (TIKA-623) Add support for Outlook PST

2014-03-05 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-623:
--

Fix Version/s: 1.6

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-623) Add support for Outlook PST

2014-03-05 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920692#comment-13920692
 ] 

Hong-Thai Nguyen edited comment on TIKA-623 at 3/5/14 9:30 AM:
---

java-libpst-0.7 has been uploaded to oss sonatype nexus: 
https://issues.sonatype.org/browse/OSSRH-8965
If there's no objection, I'll refactory attached parser and provide output as:
{code}
html xmlns=http://www.w3.org/1999/xhtml;
head
meta name=Content-Length content=271360 /
meta name=isValid content=true /
meta name=Content-Type content=application/vnd.ms-outlook /
title/title
/head
body
div class=email-folder
h1Début du fichier de données Outlook/h1
div class=email-entry
h1lt;530d9cac.5080...@gmail.comgt;/h1
meta subject=Re: Feature Generators /
meta 
internetMessageId=lt;530d9cac.5080...@gmail.comgt; /
meta descriptorNodeId=2097188 /
meta lastModificationTime=1393418263291 /
meta senderName=Jörn Kottmann /
meta senderEmailAddress=kottm...@gmail.com /
meta recipients=No recipients table! /
pmail content/p
/div
div class=email-folder
h1Éléments supprimés/h1
/div
/div
div class=email-folder
h1Racine (pour la recherche)/h1
/div
div class=email-folder
h1SPAM Search Folder 2/h1
/div
/body
/html
{code}


was (Author: thaichat04):
java-libpst-0.7 has been uploaded to oss sonatype nexus. If there's no 
objection, I'll refactory attached parser and provide output as:
{code}
html xmlns=http://www.w3.org/1999/xhtml;
head
meta name=Content-Length content=271360 /
meta name=isValid content=true /
meta name=Content-Type content=application/vnd.ms-outlook /
title/title
/head
body
div class=email-folder
h1Début du fichier de données Outlook/h1
div class=email-entry
h1lt;530d9cac.5080...@gmail.comgt;/h1
meta subject=Re: Feature Generators /
meta 
internetMessageId=lt;530d9cac.5080...@gmail.comgt; /
meta descriptorNodeId=2097188 /
meta lastModificationTime=1393418263291 /
meta senderName=Jörn Kottmann /
meta senderEmailAddress=kottm...@gmail.com /
meta recipients=No recipients table! /
pmail content/p
/div
div class=email-folder
h1Éléments supprimés/h1
/div
/div
div class=email-folder
h1Racine (pour la recherche)/h1
/div
div class=email-folder
h1SPAM Search Folder 2/h1
/div
/body
/html
{code}

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
Assignee: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (TIKA-623) Add support for Outlook PST

2014-03-05 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen reassigned TIKA-623:
-

Assignee: Hong-Thai Nguyen

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
Assignee: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-03-05 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920698#comment-13920698
 ] 

Andrew Jackson commented on TIKA-1232:
--

Does anyone have a copy of Acrobat 9.1? That version uses Adobe Extension Level 
5, so we'd need that to get the full set of recent versions. I'll have a dig 
around for suitable files for the versions that aren't covered yet, but most of 
the stuff I have access to is not re-licensable.

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-623) Add support for Outlook PST

2014-03-05 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-623.
---

Resolution: Fixed

Commit on r1574411

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
Assignee: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1256) Windows 07 excel .xlsx file Tika 1.4 api is detecting wrong mimetype.

2014-03-05 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920723#comment-13920723
 ] 

Nick Burch commented on TIKA-1256:
--

For container based formats (like ooxml), you need to use Tika Core and Tika 
Parsers (plus their dependencies) for accurate detection. Mime magic alone 
isn't enough to identify which file type (eg .xlsx) it is inside the container, 
we either need a hint about the filename, or the parsers jars for the container 
detectors

 Windows 07  excel .xlsx file Tika 1.4 api is detecting wrong mimetype. 
 -

 Key: TIKA-1256
 URL: https://issues.apache.org/jira/browse/TIKA-1256
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Kavitha

 I am using Tika 1.4 jars for standard alone project. 
 While running from eclipse Tika 1.4 jars detecting correct mimetype, 
 I build jar file from my project and running my standalone project from 
 command prompt its detecting wrong mimetype.
 I am attaching my code 
 Parser parser = new AutoDetectParser();
 InputStream stream = new FileInputStream(file);
 int writeUnlimited = -1;
 ContentHandler contentHandler = new BodyContentHandler(writeUnlimited);
 Metadata metadata = new Metadata();
 parser.parse(stream, contentHandler, metadata, new ParseContext());
 mimeType = metadata.get(Metadata.CONTENT_TYPE);
 logger.info(Correct MimeType value for ' + file.getName() + ' file is:  + 
 mimeType);
 Output from eclipse is
 Correct MimeType value for 'CIQ_83517.xlsx' file is: 
 application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
 Output from command prompt
 Correct MimeType value for 'CIQ_83517.xlsx' file is: application/x-tika-ooxml
 I have only tika 1.4 and its dependent jar files.
 Is it issue with my code or tika1.4 jar has some issue?
 Iam using java 1.6 version.
 Thanks for your help



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-03-05 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920741#comment-13920741
 ] 

Tim Allison commented on TIKA-1232:
---

That would be great!  Yes, please make sure that your contributions are 
consistent with the Apache License 2.0.  Thank you, 
[~alexandre.madur...@gmail.com] for all of your testing files! 

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1256) Windows 07 excel .xlsx file Tika 1.4 api is detecting wrong mimetype.

2014-03-05 Thread Kavitha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kavitha updated TIKA-1256:
--

Labels: Parser  (was: )

 Windows 07  excel .xlsx file Tika 1.4 api is detecting wrong mimetype. 
 -

 Key: TIKA-1256
 URL: https://issues.apache.org/jira/browse/TIKA-1256
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Kavitha
  Labels: Parser

 I am using Tika 1.4 jars for standard alone project. 
 While running from eclipse Tika 1.4 jars detecting correct mimetype, 
 I build jar file from my project and running my standalone project from 
 command prompt its detecting wrong mimetype.
 I am attaching my code 
 Parser parser = new AutoDetectParser();
 InputStream stream = new FileInputStream(file);
 int writeUnlimited = -1;
 ContentHandler contentHandler = new BodyContentHandler(writeUnlimited);
 Metadata metadata = new Metadata();
 parser.parse(stream, contentHandler, metadata, new ParseContext());
 mimeType = metadata.get(Metadata.CONTENT_TYPE);
 logger.info(Correct MimeType value for ' + file.getName() + ' file is:  + 
 mimeType);
 Output from eclipse is
 Correct MimeType value for 'CIQ_83517.xlsx' file is: 
 application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
 Output from command prompt
 Correct MimeType value for 'CIQ_83517.xlsx' file is: application/x-tika-ooxml
 I have only tika 1.4 and its dependent jar files.
 Is it issue with my code or tika1.4 jar has some issue?
 Iam using java 1.6 version.
 Thanks for your help



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Searching for Tika Jira issues using Lucene

2014-03-05 Thread Michael McCandless
Team,

If you want to search for Tika Jira issues, I just added Tika coverage
into the Lucene dog food server we use for finding Lucene/Solr
issues at http://jirasearch.mikemccandless.com.

I just posted a blog post describing recent changes:

  
http://blog.mikemccandless.com/2014/03/using-lucenes-search-server-to-search.html

Basically I started this as an effort to test Lucene's functionality
in a real application/server (searching for issues), and to eat our
own dog food, but then over time I think it's proven quite useful
and I now use it almost exclusively when I need to find a Lucene issue.

Compared to Jira's builtin search, it's more full text like; e.g.,
makes suggestions as you type, produces snippets and highlights, ranks
by blended relevence+recency, etc.  It has facets so you can quickly
drill down/sideways by various metadata.  In the results, you can
click on a snippet to go straight to the specific comment and issue
that it came from.

It uses Lucene's near-real-time indexing + searching, so issue updates
should be visible within ~ 30 seconds or so.

I hope you find it useful too!

Mike McCandless

http://blog.mikemccandless.com


Re: Searching for Tika Jira issues using Lucene

2014-03-05 Thread Oleg Tikhonov
Hi Mike!
Sounds great! Thanks.

Oleg


On Wed, Mar 5, 2014 at 6:47 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Team,

 If you want to search for Tika Jira issues, I just added Tika coverage
 into the Lucene dog food server we use for finding Lucene/Solr
 issues at http://jirasearch.mikemccandless.com.

 I just posted a blog post describing recent changes:


 http://blog.mikemccandless.com/2014/03/using-lucenes-search-server-to-search.html

 Basically I started this as an effort to test Lucene's functionality
 in a real application/server (searching for issues), and to eat our
 own dog food, but then over time I think it's proven quite useful
 and I now use it almost exclusively when I need to find a Lucene issue.

 Compared to Jira's builtin search, it's more full text like; e.g.,
 makes suggestions as you type, produces snippets and highlights, ranks
 by blended relevence+recency, etc.  It has facets so you can quickly
 drill down/sideways by various metadata.  In the results, you can
 click on a snippet to go straight to the specific comment and issue
 that it came from.

 It uses Lucene's near-real-time indexing + searching, so issue updates
 should be visible within ~ 30 seconds or so.

 I hope you find it useful too!

 Mike McCandless

 http://blog.mikemccandless.com