[jira] [Updated] (TIKA-1256) MS Office 07 excel .xlsx file Tika 1.4 api is detecting wrong mimetype.

2014-03-07 Thread Kavitha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kavitha updated TIKA-1256:
--

Summary: MS Office 07  excel .xlsx file Tika 1.4 api is detecting wrong 
mimetype.   (was: Windows 07  excel .xlsx file Tika 1.4 api is detecting 
wrong mimetype. )

 MS Office 07  excel .xlsx file Tika 1.4 api is detecting wrong mimetype. 
 ---

 Key: TIKA-1256
 URL: https://issues.apache.org/jira/browse/TIKA-1256
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Kavitha
  Labels: Parser

 I am using Tika 1.4 jars for standard alone project. 
 While running from eclipse Tika 1.4 jars detecting correct mimetype, 
 I build jar file from my project and running my standalone project from 
 command prompt its detecting wrong mimetype.
 I am attaching my code 
 Parser parser = new AutoDetectParser();
 InputStream stream = new FileInputStream(file);
 int writeUnlimited = -1;
 ContentHandler contentHandler = new BodyContentHandler(writeUnlimited);
 Metadata metadata = new Metadata();
 parser.parse(stream, contentHandler, metadata, new ParseContext());
 mimeType = metadata.get(Metadata.CONTENT_TYPE);
 logger.info(Correct MimeType value for ' + file.getName() + ' file is:  + 
 mimeType);
 Output from eclipse is
 Correct MimeType value for 'CIQ_83517.xlsx' file is: 
 application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
 Output from command prompt
 Correct MimeType value for 'CIQ_83517.xlsx' file is: application/x-tika-ooxml
 I have only tika 1.4 and its dependent jar files.
 Is it issue with my code or tika1.4 jar has some issue?
 Iam using java 1.6 version.
 Thanks for your help



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1256) MS Office 07 excel .xlsx file Tika 1.4 api is detecting wrong mimetype.

2014-03-07 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1256.
--

Resolution: Not A Problem

Resolving as it's caused by missing jars, and 
http://tika.apache.org/1.5/detection.html#Container_Aware_Detection has now 
been updated to make it clearer what's needed

 MS Office 07  excel .xlsx file Tika 1.4 api is detecting wrong mimetype. 
 ---

 Key: TIKA-1256
 URL: https://issues.apache.org/jira/browse/TIKA-1256
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Kavitha
  Labels: Parser

 I am using Tika 1.4 jars for standard alone project. 
 While running from eclipse Tika 1.4 jars detecting correct mimetype, 
 I build jar file from my project and running my standalone project from 
 command prompt its detecting wrong mimetype.
 I am attaching my code 
 Parser parser = new AutoDetectParser();
 InputStream stream = new FileInputStream(file);
 int writeUnlimited = -1;
 ContentHandler contentHandler = new BodyContentHandler(writeUnlimited);
 Metadata metadata = new Metadata();
 parser.parse(stream, contentHandler, metadata, new ParseContext());
 mimeType = metadata.get(Metadata.CONTENT_TYPE);
 logger.info(Correct MimeType value for ' + file.getName() + ' file is:  + 
 mimeType);
 Output from eclipse is
 Correct MimeType value for 'CIQ_83517.xlsx' file is: 
 application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
 Output from command prompt
 Correct MimeType value for 'CIQ_83517.xlsx' file is: application/x-tika-ooxml
 I have only tika 1.4 and its dependent jar files.
 Is it issue with my code or tika1.4 jar has some issue?
 Iam using java 1.6 version.
 Thanks for your help



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-623) Add support for Outlook PST

2014-03-07 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923703#comment-13923703
 ] 

Hong-Thai Nguyen commented on TIKA-623:
---

[~lfcnassif], binary attached is handled with embeddedExtractor. BTW, I agree 
that we can split each mail to a separate unit.
[~talli...@apache.org], we couldn't fix .pst and .msg (msg is already handled 
as part of OfficeParser), and feel free to finish properly this issue as you 
can :)

 Add support for Outlook PST
 ---

 Key: TIKA-623
 URL: https://issues.apache.org/jira/browse/TIKA-623
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tran Nam Quang
Assignee: Hong-Thai Nguyen
 Fix For: 1.6

 Attachments: OutlookPSTParser.java


 Hello everyone,
 As you might know, Outlook stores its mails and other stuff in a single PST 
 file. There's a relatively new Java library called java-libpst for reading 
 Outlook PST files. It is licensed under the LGPL and available over here: 
 http://code.google.com/p/java-libpst/
 I have tested the library on Outlook 2000 and Outlook 2003, with good 
 results. It would be great if the library could be integrated into Tika.
 Best regards
 Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1258) Update NetCDF dependency

2014-03-07 Thread Konstantin Gribov (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov updated TIKA-1258:


Attachment: update-netcdf-to-4.2.20.patch

 Update NetCDF dependency
 

 Key: TIKA-1258
 URL: https://issues.apache.org/jira/browse/TIKA-1258
 Project: Tika
  Issue Type: Improvement
  Components: general, packaging
Affects Versions: 1.5
 Environment: N/A
Reporter: Konstantin Gribov
  Labels: maven, netcdf
 Fix For: 1.6

 Attachments: update-netcdf-to-4.2.20.patch


 Now tika-parsers depends on edu.ucar:netcdf:4.2-min.
 Current netcdf version is 4.2.20 and we can update it in trunk. All tests 
 pass successfully with netcdf updated to 4.2.20 and slf4j-log4j12 to 1.6.1 
 (since netcdf updated slf4j dependency to 1.6.1).
 Patch is attached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Unconsistent logging in current tika (1.5)

2014-03-07 Thread Konstantin Gribov
Hello, Nick.

I'll answer in reverse order.

Can you open a jira for that upgrade? If you can also try it locally, and
 report on the jira if all the unit tests still pass, that'd be a help!

Tested on local build and opened ticket for it:
https://issues.apache.org/jira/browse/TIKA-1258.

If we don't want to add logging api to tika-core, then it should not change
anything in tika-core. Also we should drop log4j.properties from
tika-core/src/test/resources because we don't use log4j in it.

Libraries shouldn't have any logging *setup* because it can affect
application which use library. Except for test dependencies.

Would you accept patch that bring all logging in tika-parsers, tika-app and
tika-server to consistent system based on slf4j (which is present in each
of these modules due to dependencies)? Would you accept it if it excludes
jcl (commons-logging) and brings jcl to slf4j bridge?

I want to improve tika to allow other developers using it to have less
headache on configuring tika dependencies.

-- 
Best regards,
Konstantin Gribov.


2014-03-07 9:03 GMT+04:00 Nick Burch apa...@gagravarr.org:

 On Fri, 7 Mar 2014, Konstantin Gribov wrote:

 Tika-core is quite pure (uses only java.util.logging) but tika-parsers
 uses commons-logging 1.1.1 (through pdfbox), slf4j-api 1.5.6 (through
 netcdf) and log4j 1.2.14 (through slf4j-log4j as test scope dependency).
 Also some parsers (like pdfbox) logs just to stdout/stderr.


 I think part of the issue is that many of the libraries that Tika depends
 on have their own chosen logging library / setup. IIRC, the Tika parsers
 often log in a similar manner to the underlying library they use.

 That's not to say that we can't tidy things up a bit, but it does restrict
 how much we can do where log messages come from underlying libraries


  It's confusing.

 Tika-core use only JUL.


 Tika-Core ideally shouldn't have any external depdencies, so I'm not sure
 what else it can use while maintaining that?


  Tika-parsers use JCL and log4j (in tests) and depends on slf4j-api.
 Tika-app use JCL, configures log4j in runtime (to change verbosity level)
 and depends on slf4j-log4j12.
 Tika-server use only JCL but depends on slj4j-api 1.7.5 (through cxf).


 Potentially some of these could be rationalised, though maybe the best we
 can hope for is to ensure they only use whatever their underlying
 dependencies use


  By the way, I think we also should update edu.ucar:netcdf to 4.2.20 that
 depends on newer slf4j-api 1.6.1.


 Can you open a jira for that upgrade? If you can also try it locally, and
 report on the jira if all the unit tests still pass, that'd be a help!

 Thanks
 Nick