[jira] [Updated] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-623: -- Fix Version/s: 1.6 Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920692#comment-13920692 ] Hong-Thai Nguyen edited comment on TIKA-623 at 3/5/14 9:30 AM: --- java-libpst-0.7 has been uploaded to oss sonatype nexus: https://issues.sonatype.org/browse/OSSRH-8965 If there's no objection, I'll refactory attached parser and provide output as: {code} html xmlns=http://www.w3.org/1999/xhtml; head meta name=Content-Length content=271360 / meta name=isValid content=true / meta name=Content-Type content=application/vnd.ms-outlook / title/title /head body div class=email-folder h1Début du fichier de données Outlook/h1 div class=email-entry h1lt;530d9cac.5080...@gmail.comgt;/h1 meta subject=Re: Feature Generators / meta internetMessageId=lt;530d9cac.5080...@gmail.comgt; / meta descriptorNodeId=2097188 / meta lastModificationTime=1393418263291 / meta senderName=Jörn Kottmann / meta senderEmailAddress=kottm...@gmail.com / meta recipients=No recipients table! / pmail content/p /div div class=email-folder h1Éléments supprimés/h1 /div /div div class=email-folder h1Racine (pour la recherche)/h1 /div div class=email-folder h1SPAM Search Folder 2/h1 /div /body /html {code} was (Author: thaichat04): java-libpst-0.7 has been uploaded to oss sonatype nexus. If there's no objection, I'll refactory attached parser and provide output as: {code} html xmlns=http://www.w3.org/1999/xhtml; head meta name=Content-Length content=271360 / meta name=isValid content=true / meta name=Content-Type content=application/vnd.ms-outlook / title/title /head body div class=email-folder h1Début du fichier de données Outlook/h1 div class=email-entry h1lt;530d9cac.5080...@gmail.comgt;/h1 meta subject=Re: Feature Generators / meta internetMessageId=lt;530d9cac.5080...@gmail.comgt; / meta descriptorNodeId=2097188 / meta lastModificationTime=1393418263291 / meta senderName=Jörn Kottmann / meta senderEmailAddress=kottm...@gmail.com / meta recipients=No recipients table! / pmail content/p /div div class=email-folder h1Éléments supprimés/h1 /div /div div class=email-folder h1Racine (pour la recherche)/h1 /div div class=email-folder h1SPAM Search Folder 2/h1 /div /body /html {code} Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Assignee: Hong-Thai Nguyen Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen reassigned TIKA-623: - Assignee: Hong-Thai Nguyen Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Assignee: Hong-Thai Nguyen Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920698#comment-13920698 ] Andrew Jackson commented on TIKA-1232: -- Does anyone have a copy of Acrobat 9.1? That version uses Adobe Extension Level 5, so we'd need that to get the full set of recent versions. I'll have a dig around for suitable files for the versions that aren't covered yet, but most of the stuff I have access to is not re-licensable. Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-623. --- Resolution: Fixed Commit on r1574411 Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Assignee: Hong-Thai Nguyen Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1256) Windows 07 excel .xlsx file Tika 1.4 api is detecting wrong mimetype.
[ https://issues.apache.org/jira/browse/TIKA-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920723#comment-13920723 ] Nick Burch commented on TIKA-1256: -- For container based formats (like ooxml), you need to use Tika Core and Tika Parsers (plus their dependencies) for accurate detection. Mime magic alone isn't enough to identify which file type (eg .xlsx) it is inside the container, we either need a hint about the filename, or the parsers jars for the container detectors Windows 07 excel .xlsx file Tika 1.4 api is detecting wrong mimetype. - Key: TIKA-1256 URL: https://issues.apache.org/jira/browse/TIKA-1256 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Kavitha I am using Tika 1.4 jars for standard alone project. While running from eclipse Tika 1.4 jars detecting correct mimetype, I build jar file from my project and running my standalone project from command prompt its detecting wrong mimetype. I am attaching my code Parser parser = new AutoDetectParser(); InputStream stream = new FileInputStream(file); int writeUnlimited = -1; ContentHandler contentHandler = new BodyContentHandler(writeUnlimited); Metadata metadata = new Metadata(); parser.parse(stream, contentHandler, metadata, new ParseContext()); mimeType = metadata.get(Metadata.CONTENT_TYPE); logger.info(Correct MimeType value for ' + file.getName() + ' file is: + mimeType); Output from eclipse is Correct MimeType value for 'CIQ_83517.xlsx' file is: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet Output from command prompt Correct MimeType value for 'CIQ_83517.xlsx' file is: application/x-tika-ooxml I have only tika 1.4 and its dependent jar files. Is it issue with my code or tika1.4 jar has some issue? Iam using java 1.6 version. Thanks for your help -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920741#comment-13920741 ] Tim Allison commented on TIKA-1232: --- That would be great! Yes, please make sure that your contributions are consistent with the Apache License 2.0. Thank you, [~alexandre.madur...@gmail.com] for all of your testing files! Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1256) Windows 07 excel .xlsx file Tika 1.4 api is detecting wrong mimetype.
[ https://issues.apache.org/jira/browse/TIKA-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kavitha updated TIKA-1256: -- Labels: Parser (was: ) Windows 07 excel .xlsx file Tika 1.4 api is detecting wrong mimetype. - Key: TIKA-1256 URL: https://issues.apache.org/jira/browse/TIKA-1256 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Kavitha Labels: Parser I am using Tika 1.4 jars for standard alone project. While running from eclipse Tika 1.4 jars detecting correct mimetype, I build jar file from my project and running my standalone project from command prompt its detecting wrong mimetype. I am attaching my code Parser parser = new AutoDetectParser(); InputStream stream = new FileInputStream(file); int writeUnlimited = -1; ContentHandler contentHandler = new BodyContentHandler(writeUnlimited); Metadata metadata = new Metadata(); parser.parse(stream, contentHandler, metadata, new ParseContext()); mimeType = metadata.get(Metadata.CONTENT_TYPE); logger.info(Correct MimeType value for ' + file.getName() + ' file is: + mimeType); Output from eclipse is Correct MimeType value for 'CIQ_83517.xlsx' file is: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet Output from command prompt Correct MimeType value for 'CIQ_83517.xlsx' file is: application/x-tika-ooxml I have only tika 1.4 and its dependent jar files. Is it issue with my code or tika1.4 jar has some issue? Iam using java 1.6 version. Thanks for your help -- This message was sent by Atlassian JIRA (v6.2#6252)
Searching for Tika Jira issues using Lucene
Team, If you want to search for Tika Jira issues, I just added Tika coverage into the Lucene dog food server we use for finding Lucene/Solr issues at http://jirasearch.mikemccandless.com. I just posted a blog post describing recent changes: http://blog.mikemccandless.com/2014/03/using-lucenes-search-server-to-search.html Basically I started this as an effort to test Lucene's functionality in a real application/server (searching for issues), and to eat our own dog food, but then over time I think it's proven quite useful and I now use it almost exclusively when I need to find a Lucene issue. Compared to Jira's builtin search, it's more full text like; e.g., makes suggestions as you type, produces snippets and highlights, ranks by blended relevence+recency, etc. It has facets so you can quickly drill down/sideways by various metadata. In the results, you can click on a snippet to go straight to the specific comment and issue that it came from. It uses Lucene's near-real-time indexing + searching, so issue updates should be visible within ~ 30 seconds or so. I hope you find it useful too! Mike McCandless http://blog.mikemccandless.com
Re: Searching for Tika Jira issues using Lucene
Hi Mike! Sounds great! Thanks. Oleg On Wed, Mar 5, 2014 at 6:47 PM, Michael McCandless luc...@mikemccandless.com wrote: Team, If you want to search for Tika Jira issues, I just added Tika coverage into the Lucene dog food server we use for finding Lucene/Solr issues at http://jirasearch.mikemccandless.com. I just posted a blog post describing recent changes: http://blog.mikemccandless.com/2014/03/using-lucenes-search-server-to-search.html Basically I started this as an effort to test Lucene's functionality in a real application/server (searching for issues), and to eat our own dog food, but then over time I think it's proven quite useful and I now use it almost exclusively when I need to find a Lucene issue. Compared to Jira's builtin search, it's more full text like; e.g., makes suggestions as you type, produces snippets and highlights, ranks by blended relevence+recency, etc. It has facets so you can quickly drill down/sideways by various metadata. In the results, you can click on a snippet to go straight to the specific comment and issue that it came from. It uses Lucene's near-real-time indexing + searching, so issue updates should be visible within ~ 30 seconds or so. I hope you find it useful too! Mike McCandless http://blog.mikemccandless.com