[jira] [Commented] (TIKA-1292) Inconsistent priorities in bundled tika-mimetypes.xml

2014-05-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008652#comment-14008652
 ] 

Hudson commented on TIKA-1292:
--

FAILURE: Integrated in tika-trunk-jdk1.7 #3 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/3/])
TIKA-1292 If there is more than one mime magic which matches at the highest 
priority, keep track and then try to pick based on filename or type hint later 
(nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1596612)
* /tika/trunk/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java
* 
/tika/trunk/tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java
* 
/tika/trunk/tika-core/src/test/java/org/apache/tika/mime/MimeTypesReaderTest.java
Set an explicit priority on the OLE2 match, remove two MS Word matches which 
were OLE2 ones in disguise, and add an intermediate staroffice parent on the 
staroffice types. Helps with TIKA-1292 testing (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1596611)
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
Add a disabled unit test for TIKA-1292, which when working will ensure that if 
we have two matching magics at the same priority, the name is used to 
specialise if possible, first defined if not (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1596593)
* 
/tika/trunk/tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java
* 
/tika/trunk/tika-core/src/test/resources/org/apache/tika/mime/custom-mimetypes.xml
Container formats with specific, low-false-positive magic matches need a 
slightly higher priority, so that they don't accidently end up being matched 
based on the contents of the container near the start of the file. Partly 
solves TIKA-1292. This closes #6 github pull request (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1596590)
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java
Add some notes on entries, to help people maintaining the file know what to do, 
related to TIKA-1292 (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1596586)
* 
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


 Inconsistent priorities in bundled tika-mimetypes.xml
 -

 Key: TIKA-1292
 URL: https://issues.apache.org/jira/browse/TIKA-1292
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.5
Reporter: Cservenak, Tamas
 Fix For: 1.6


 It seems that mime-type priorities are a bit inconsistent in the tika-core 
 bundled tika-mimetypes.xml
 Few examples:
 * 
 [application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497]
  vs 
 [application/x-7z-compressed|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3510]:
  both are similar containers archive formats (structured, having entries), 
 having distinct file extensions (zip vs 7z globs), still priorities are 
 40 and 50 respectively.
 * 
 [application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497]
  vs 
 [text/html|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L4713]:
  not quite related MIME types, having same priority of 40. But ZIP files can 
 be uncompressed (meaning entries are mostly concatenated, and their 
 content, if plaintext, is readable). Hence, having an uncompressed ZIP (or 
 any subclass like JAR) file that contains HTML files zipped up might/will be 
 detected as HTML, which is wrong. 
 And this is what happens in Nexus that uses Tika under the hud for content 
 validation, basically using MIME magic detection provided by Tika Detector: 
 the Java JAR {{com.intellij:annotations:7.0.3}} 
 ([link|http://repo1.maven.org/maven2/com/intellij/annotations/7.0.3/]) is 
 being detected as {{text/html}} instead of (expected) 
 {{application/java-archive}}.
 Reason is following: the JAR file is zipped up in uncompressed zip format, 
 and among few annotations it also contains one HTML file entry (the license I 
 guess). Since both MIME types have same priority (40), I guess tika 
 randomly chooses the {{text/html}}.
 Original Nexus issue
 https://issues.sonatype.org/browse/NEXUS-6560
 At Nexus issue there is a GH Pull Request that solves the problem for us (by 
 raising {{application/zip}} priority to 41.
 But by inspecting the bundled tike-mimetypes.xml we spotted other -- probably 
 -- priority inconsistencies, like that of zip vs 7z mentioned above.
 Note: this happens when using 

[jira] [Commented] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE

2014-05-26 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008704#comment-14008704
 ] 

Hong-Thai Nguyen commented on TIKA-1308:


A virtual FileSystem may be a solution, If you're on Java 7. The NIO APIs with 
FileSytemProvider [1] allows you define or inject a Virtual FileSystem (eg. 
Common VFS [2]).

[1] 
http://docs.oracle.com/javase/7/docs/api/java/nio/file/spi/FileSystemProvider.html
[2] http://commons.apache.org/proper/commons-vfs/filesystems.html






 Support in memory parse mode(don't create temp file): to support run Tika in 
 GAE
 

 Key: TIKA-1308
 URL: https://issues.apache.org/jira/browse/TIKA-1308
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: yuanyun.cn
  Labels: gae
 Fix For: 1.6


 I am trying to use Tika in GAE and write a simple servlet to extract meta 
 data info from jpeg:
 String urlStr = req.getParameter(imageUrl);
 byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr));
 ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData);
 Metadata metadata = new Metadata();
 BodyContentHandler ch = new BodyContentHandler();
 AutoDetectParser parser = new AutoDetectParser();
 parser.parse(bais, ch, metadata, new ParseContext());
 bais.close();
 This fails with exception:
 Caused by: java.lang.SecurityException: Unable to create temporary file
   at java.io.File.createTempFile(File.java:1986)
   at 
 org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
 Checked the code, in 
 org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, 
 Metadata, ParseContext), it creates a temp file from the input stream.
 I can understand why tika create temp file from the stream: so tika can parse 
 it multiple times.
 But as GAE and other cloud servers are getting more popular, is it possible 
 to avoid create temp file: instead we can copy the origin stream to a 
 byteArray stream, so tika can also parse it multiple times.
 -- This will have a limit on the file size, as tika keeps the whole file in 
 memory, but this can make tika work in GAE and maybe other cloud server.
 We can add a parameter in parser.parse to indicate whether do in memory parse 
 only.
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)