[jira] [Commented] (TIKA-3308) SVG file without xml declaration tag is detected as text/plain

2022-11-17 Thread Tsuyoshi Yoshizawa (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635630#comment-17635630
 ] 

Tsuyoshi Yoshizawa commented on TIKA-3308:
--

I have the same issue.

> Checking for {{http://www.w3.org/2000/svg"}} with a decent 
> priority should be fine, but I'm not sure we'd want to look for just {{ SVG file without xml declaration tag is detected as text/plain
> --
>
> Key: TIKA-3308
> URL: https://issues.apache.org/jira/browse/TIKA-3308
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.25
>Reporter: Anas Hammani
>Priority: Minor
> Attachments: logo-luma.svg
>
>
> The SVG file attached to the issue is interpreted as *text/plain* by
> {code:java}
> tika.detect(filePath){code}
>  
> If I add 
> {code:java}
>   {code}
> at the beginning of the file, then tika detects it as  "image/svg+xml"
>  
> When i read the documentation i see that xml is not necessary for a file to 
> be well-formed
> [https://www.w3.org/TR/REC-xml/#sec-prolog-dtd]
>  
> It will be great if tika can detect a file as a SVG without the prolog
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-1717) Tika throws exception on detecting content-type of a zip file

2016-04-12 Thread Tsuyoshi Yoshizawa (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236840#comment-15236840
 ] 

Tsuyoshi Yoshizawa commented on TIKA-1717:
--

Commons Compress 1.11 is already released including fixed COMPRESS-321.
I hope TIKA also release new version with Commons Compress 1.11.

> Tika throws exception on detecting content-type of a zip file
> -
>
> Key: TIKA-1717
> URL: https://issues.apache.org/jira/browse/TIKA-1717
> Project: Tika
>  Issue Type: Bug
>Reporter: Martin Petricek
>
> When trying to detect content type of a zip file with Tika 1.10 in manner 
> like this:
> {code}
> byte[] content = ... // whole zip file.
> String name = "TR_01.ZIP";
> Tika tika = new Tika();
> return tika.detect(content, name);
> {code}
> it throws an exception:
> {code}
> java.lang.ArrayIndexOutOfBoundsException: 13
>   at 
> org.apache.commons.compress.archivers.zip.X7875_NewUnix.parseFromLocalFileData(X7875_NewUnix.java:199)
>   at 
> org.apache.commons.compress.archivers.zip.X7875_NewUnix.parseFromCentralDirectoryData(X7875_NewUnix.java:220)
>   at 
> org.apache.commons.compress.archivers.zip.ExtraFieldUtils.parse(ExtraFieldUtils.java:174)
>   at 
> org.apache.commons.compress.archivers.zip.ZipArchiveEntry.setCentralDirectoryExtra(ZipArchiveEntry.java:476)
>   at 
> org.apache.commons.compress.archivers.zip.ZipFile.readCentralDirectoryEntry(ZipFile.java:575)
>   at 
> org.apache.commons.compress.archivers.zip.ZipFile.populateFromCentralDirectory(ZipFile.java:492)
>   at 
> org.apache.commons.compress.archivers.zip.ZipFile.(ZipFile.java:216)
>   at 
> org.apache.commons.compress.archivers.zip.ZipFile.(ZipFile.java:192)
>   at 
> org.apache.commons.compress.archivers.zip.ZipFile.(ZipFile.java:153)
>   at 
> org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:141)
>   at 
> org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88)
>   at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>   at org.apache.tika.Tika.detect(Tika.java:155)
>   at org.apache.tika.Tika.detect(Tika.java:183)
>   at org.apache.tika.Tika.detect(Tika.java:223)
> {code}
> The zip file does contain two .jpg images and is not a "special" (JAR, 
> Openoffice, ... ) zip file.
> Unfortunately, the contents of the zip file is confidential and so I cannot 
> attach it to this ticket as it is, although I can provide the parameters 
> supplied to
> org.apache.commons.compress.archivers.zip.X7875_NewUnix.parseFromLocalFileData(X7875_NewUnix.java:199)
>  as caught by the debugger:
> {code}
> data = {byte[13]@2103}
>  0 = 85
>  1 = 84
>  2 = 5
>  3 = 0
>  4 = 7
>  5 = -112
>  6 = -108
>  7 = 51
>  8 = 85
>  9 = 117
>  10 = 120
>  11 = 0
>  12 = 0
> offset = 13
> length = 0
> {code}
> ... it seems the method tries to read more bytes than is actually available 
> in the buffer.
> Note that 7zip and unzip can unzip the file without even a warning, so it 
> does not seem like a corrupted file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)