[jira] [Updated] (TIKA-1256) MS Office 07 excel .xlsx file Tika 1.4 api is detecting wrong mimetype.
[ https://issues.apache.org/jira/browse/TIKA-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kavitha updated TIKA-1256: -- Summary: MS Office 07 excel .xlsx file Tika 1.4 api is detecting wrong mimetype. (was: Windows 07 excel .xlsx file Tika 1.4 api is detecting wrong mimetype. ) MS Office 07 excel .xlsx file Tika 1.4 api is detecting wrong mimetype. --- Key: TIKA-1256 URL: https://issues.apache.org/jira/browse/TIKA-1256 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Kavitha Labels: Parser I am using Tika 1.4 jars for standard alone project. While running from eclipse Tika 1.4 jars detecting correct mimetype, I build jar file from my project and running my standalone project from command prompt its detecting wrong mimetype. I am attaching my code Parser parser = new AutoDetectParser(); InputStream stream = new FileInputStream(file); int writeUnlimited = -1; ContentHandler contentHandler = new BodyContentHandler(writeUnlimited); Metadata metadata = new Metadata(); parser.parse(stream, contentHandler, metadata, new ParseContext()); mimeType = metadata.get(Metadata.CONTENT_TYPE); logger.info(Correct MimeType value for ' + file.getName() + ' file is: + mimeType); Output from eclipse is Correct MimeType value for 'CIQ_83517.xlsx' file is: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet Output from command prompt Correct MimeType value for 'CIQ_83517.xlsx' file is: application/x-tika-ooxml I have only tika 1.4 and its dependent jar files. Is it issue with my code or tika1.4 jar has some issue? Iam using java 1.6 version. Thanks for your help -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1256) MS Office 07 excel .xlsx file Tika 1.4 api is detecting wrong mimetype.
[ https://issues.apache.org/jira/browse/TIKA-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1256. -- Resolution: Not A Problem Resolving as it's caused by missing jars, and http://tika.apache.org/1.5/detection.html#Container_Aware_Detection has now been updated to make it clearer what's needed MS Office 07 excel .xlsx file Tika 1.4 api is detecting wrong mimetype. --- Key: TIKA-1256 URL: https://issues.apache.org/jira/browse/TIKA-1256 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Kavitha Labels: Parser I am using Tika 1.4 jars for standard alone project. While running from eclipse Tika 1.4 jars detecting correct mimetype, I build jar file from my project and running my standalone project from command prompt its detecting wrong mimetype. I am attaching my code Parser parser = new AutoDetectParser(); InputStream stream = new FileInputStream(file); int writeUnlimited = -1; ContentHandler contentHandler = new BodyContentHandler(writeUnlimited); Metadata metadata = new Metadata(); parser.parse(stream, contentHandler, metadata, new ParseContext()); mimeType = metadata.get(Metadata.CONTENT_TYPE); logger.info(Correct MimeType value for ' + file.getName() + ' file is: + mimeType); Output from eclipse is Correct MimeType value for 'CIQ_83517.xlsx' file is: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet Output from command prompt Correct MimeType value for 'CIQ_83517.xlsx' file is: application/x-tika-ooxml I have only tika 1.4 and its dependent jar files. Is it issue with my code or tika1.4 jar has some issue? Iam using java 1.6 version. Thanks for your help -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-623) Add support for Outlook PST
[ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923703#comment-13923703 ] Hong-Thai Nguyen commented on TIKA-623: --- [~lfcnassif], binary attached is handled with embeddedExtractor. BTW, I agree that we can split each mail to a separate unit. [~talli...@apache.org], we couldn't fix .pst and .msg (msg is already handled as part of OfficeParser), and feel free to finish properly this issue as you can :) Add support for Outlook PST --- Key: TIKA-623 URL: https://issues.apache.org/jira/browse/TIKA-623 Project: Tika Issue Type: New Feature Components: parser Reporter: Tran Nam Quang Assignee: Hong-Thai Nguyen Fix For: 1.6 Attachments: OutlookPSTParser.java Hello everyone, As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/ I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika. Best regards Tran Nam Quang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1258) Update NetCDF dependency
[ https://issues.apache.org/jira/browse/TIKA-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov updated TIKA-1258: Attachment: update-netcdf-to-4.2.20.patch Update NetCDF dependency Key: TIKA-1258 URL: https://issues.apache.org/jira/browse/TIKA-1258 Project: Tika Issue Type: Improvement Components: general, packaging Affects Versions: 1.5 Environment: N/A Reporter: Konstantin Gribov Labels: maven, netcdf Fix For: 1.6 Attachments: update-netcdf-to-4.2.20.patch Now tika-parsers depends on edu.ucar:netcdf:4.2-min. Current netcdf version is 4.2.20 and we can update it in trunk. All tests pass successfully with netcdf updated to 4.2.20 and slf4j-log4j12 to 1.6.1 (since netcdf updated slf4j dependency to 1.6.1). Patch is attached. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Unconsistent logging in current tika (1.5)
Hello, Nick. I'll answer in reverse order. Can you open a jira for that upgrade? If you can also try it locally, and report on the jira if all the unit tests still pass, that'd be a help! Tested on local build and opened ticket for it: https://issues.apache.org/jira/browse/TIKA-1258. If we don't want to add logging api to tika-core, then it should not change anything in tika-core. Also we should drop log4j.properties from tika-core/src/test/resources because we don't use log4j in it. Libraries shouldn't have any logging *setup* because it can affect application which use library. Except for test dependencies. Would you accept patch that bring all logging in tika-parsers, tika-app and tika-server to consistent system based on slf4j (which is present in each of these modules due to dependencies)? Would you accept it if it excludes jcl (commons-logging) and brings jcl to slf4j bridge? I want to improve tika to allow other developers using it to have less headache on configuring tika dependencies. -- Best regards, Konstantin Gribov. 2014-03-07 9:03 GMT+04:00 Nick Burch apa...@gagravarr.org: On Fri, 7 Mar 2014, Konstantin Gribov wrote: Tika-core is quite pure (uses only java.util.logging) but tika-parsers uses commons-logging 1.1.1 (through pdfbox), slf4j-api 1.5.6 (through netcdf) and log4j 1.2.14 (through slf4j-log4j as test scope dependency). Also some parsers (like pdfbox) logs just to stdout/stderr. I think part of the issue is that many of the libraries that Tika depends on have their own chosen logging library / setup. IIRC, the Tika parsers often log in a similar manner to the underlying library they use. That's not to say that we can't tidy things up a bit, but it does restrict how much we can do where log messages come from underlying libraries It's confusing. Tika-core use only JUL. Tika-Core ideally shouldn't have any external depdencies, so I'm not sure what else it can use while maintaining that? Tika-parsers use JCL and log4j (in tests) and depends on slf4j-api. Tika-app use JCL, configures log4j in runtime (to change verbosity level) and depends on slf4j-log4j12. Tika-server use only JCL but depends on slj4j-api 1.7.5 (through cxf). Potentially some of these could be rationalised, though maybe the best we can hope for is to ensure they only use whatever their underlying dependencies use By the way, I think we also should update edu.ucar:netcdf to 4.2.20 that depends on newer slf4j-api 1.6.1. Can you open a jira for that upgrade? If you can also try it locally, and report on the jira if all the unit tests still pass, that'd be a help! Thanks Nick