[ 
https://issues.apache.org/jira/browse/TIKA-3833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577508#comment-17577508
 ] 

Tim Allison commented on TIKA-3833:
-----------------------------------

Y, agreed.  Thank you for sharing the example files.  I'm able to replicate 
this now with your repo and in tika-core.  

When I added the unit tests to tika-parsers-standard-package, that brought in 
the DefaultZipContainerDetector which in turn relies on detection from 
commons-compress.  That bypasses the mime detection you pointed out.  So, for 
those using tika-app and tika-server, this isn't a problem.

However, I agree that we need to fix our mimetypes definitions.

> bzip2 MIME type is detected as bzip instead when using tika-core
> ----------------------------------------------------------------
>
>                 Key: TIKA-3833
>                 URL: https://issues.apache.org/jira/browse/TIKA-3833
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.4.1
>            Reporter: Eduardas Kazakas
>            Priority: Major
>         Attachments: tika-bug.zip
>
>
> Hello, I'm having a bit of a problem when using the tika-core module (v2.4.1).
> I am trying to detect the MIME type of a bzip2 file and, instead of
> application/x-bzip2, I am getting application/x-bzip. I believe it has
> something to do with the mime-type definitions in the
> tika-mimetypes.xml file.
> {code:java}
> <mime-type type="application/x-bzip">
>   <magic priority="40">
>     <match value="BZh" type="string" offset="0"/>
>   </magic>
>   <glob pattern="*.bz"/>
>   <glob pattern="*.tbz"/>
> </mime-type>
> <mime-type type="application/x-bzip2">
>   <sub-class-of type="application/x-bzip"/>
>   <_comment>Bzip 2 UNIX Compressed File</_comment>
>   <magic priority="40">
>     <match value="\x42\x5a\x68\x39\x31" type="string" offset="0"/>
>   </magic>
>   <glob pattern="*.bz2"/>
>   <glob pattern="*.tbz2"/>
>   <glob pattern="*.boz"/>
> </mime-type>{code}
> The priority for these is set to 40, I believe that the priority of
> application/x-bzip2 should be higher, because string value "BZh" and
> hex value part "\x42\x5a\x68" are equal. x42\x5a\x68 = BZh.
> Maybe I am missing something here? Does this look like a bug or this
> works as intended? Maybe I can provide some sort of hint for the
> default detector?
> A small example in Scala:
> {code:java}
> import org.apache.tika.config.TikaConfig
> import org.apache.tika.detect.DefaultProbDetector
> import org.apache.tika.metadata.{Metadata, TikaCoreProperties}
> import java.io.{BufferedInputStream, File, FileInputStream}
> object AAA {
>   def main(args: Array[String]): Unit = {
>     val config = TikaConfig.getDefaultConfig
>     val file = new File("/home/ekazakas/test.csv.bz2")
>     val detector = new DefaultProbDetector()
>     val mediaType = detector.detect(new BufferedInputStream(new 
> FileInputStream(file)), new Metadata)
>     val mimeType = config.getMimeRepository.forName(mediaType.toString)
>     println(mimeType)
>   }
> } {code}
> This prints `application/x-bzip` instead of `application/x-bzip2`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to