[ https://issues.apache.org/jira/browse/TIKA-3833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577529#comment-17577529 ]
Tim Allison commented on TIKA-3833: ----------------------------------- Just pushed the updates and added your example files to a unit test in tika-core. On your test example: {noformat} application/x-bzip2 application/x-bzip2 application/x-bzip2 application/x-bzip2 application/x-bzip2 application/x-bzip2 ==================================== application/x-bzip2 application/x-bzip2 application/x-bzip2 application/x-bzip2 application/x-bzip2 application/x-bzip2 ==================================== application/x-bzip2 application/x-bzip2 application/x-bzip2 application/x-bzip2 application/x-bzip2 application/x-bzip2 {noformat} > bzip2 MIME type is detected as bzip instead when using tika-core > ---------------------------------------------------------------- > > Key: TIKA-3833 > URL: https://issues.apache.org/jira/browse/TIKA-3833 > Project: Tika > Issue Type: Bug > Components: core > Affects Versions: 2.4.1 > Reporter: Eduardas Kazakas > Priority: Major > Attachments: tika-bug.zip > > > Hello, I'm having a bit of a problem when using the tika-core module (v2.4.1). > I am trying to detect the MIME type of a bzip2 file and, instead of > application/x-bzip2, I am getting application/x-bzip. I believe it has > something to do with the mime-type definitions in the > tika-mimetypes.xml file. > {code:java} > <mime-type type="application/x-bzip"> > <magic priority="40"> > <match value="BZh" type="string" offset="0"/> > </magic> > <glob pattern="*.bz"/> > <glob pattern="*.tbz"/> > </mime-type> > <mime-type type="application/x-bzip2"> > <sub-class-of type="application/x-bzip"/> > <_comment>Bzip 2 UNIX Compressed File</_comment> > <magic priority="40"> > <match value="\x42\x5a\x68\x39\x31" type="string" offset="0"/> > </magic> > <glob pattern="*.bz2"/> > <glob pattern="*.tbz2"/> > <glob pattern="*.boz"/> > </mime-type>{code} > The priority for these is set to 40, I believe that the priority of > application/x-bzip2 should be higher, because string value "BZh" and > hex value part "\x42\x5a\x68" are equal. x42\x5a\x68 = BZh. > Maybe I am missing something here? Does this look like a bug or this > works as intended? Maybe I can provide some sort of hint for the > default detector? > A small example in Scala: > {code:java} > import org.apache.tika.config.TikaConfig > import org.apache.tika.detect.DefaultProbDetector > import org.apache.tika.metadata.{Metadata, TikaCoreProperties} > import java.io.{BufferedInputStream, File, FileInputStream} > object AAA { > def main(args: Array[String]): Unit = { > val config = TikaConfig.getDefaultConfig > val file = new File("/home/ekazakas/test.csv.bz2") > val detector = new DefaultProbDetector() > val mediaType = detector.detect(new BufferedInputStream(new > FileInputStream(file)), new Metadata) > val mimeType = config.getMimeRepository.forName(mediaType.toString) > println(mimeType) > } > } {code} > This prints `application/x-bzip` instead of `application/x-bzip2`. > -- This message was sent by Atlassian Jira (v8.20.10#820010)