Eduardas Kazakas created TIKA-3833: -------------------------------------- Summary: bzip2 MIME type is detected as bzip instead when using tika-core Key: TIKA-3833 URL: https://issues.apache.org/jira/browse/TIKA-3833 Project: Tika Issue Type: Bug Components: core Affects Versions: 2.4.1 Reporter: Eduardas Kazakas
Hello, I'm having a bit of a problem when using the tika-core module (v2.4.1). I am trying to detect the MIME type of a bzip2 file and, instead of application/x-bzip2, I am getting application/x-bzip. I believe it has something to do with the mime-type definitions in the tika-mimetypes.xml file. <mime-type type="application/x-bzip"> <magic priority="40"> <match value="BZh" type="string" offset="0"/> </magic> <glob pattern="*.bz"/> <glob pattern="*.tbz"/> </mime-type> <mime-type type="application/x-bzip2"> <sub-class-of type="application/x-bzip"/> <_comment>Bzip 2 UNIX Compressed File</_comment> <magic priority="40"> <match value="\x42\x5a\x68\x39\x31" type="string" offset="0"/> </magic> <glob pattern="*.bz2"/> <glob pattern="*.tbz2"/> <glob pattern="*.boz"/> </mime-type> The priority for these is set to 40, I believe that the priority of application/x-bzip2 should be higher, because string value "BZh" and hex value part "\x42\x5a\x68" are equal. x42\x5a\x68 = BZh. Maybe I am missing something here? Does this look like a bug or this works as intended? Maybe I can provide some sort of hint for the default detector? A small example in Scala: {code:java} import org.apache.tika.config.TikaConfig import org.apache.tika.detect.DefaultProbDetector import org.apache.tika.metadata.{Metadata, TikaCoreProperties} import java.io.{BufferedInputStream, File, FileInputStream} object AAA { def main(args: Array[String]): Unit = { val config = TikaConfig.getDefaultConfig val file = new File("/home/ekazakas/test.csv.bz2") val detector = new DefaultProbDetector() val mediaType = detector.detect(new BufferedInputStream(new FileInputStream(file)), new Metadata) val mimeType = config.getMimeRepository.forName(mediaType.toString) println(mimeType) } } {code} This prints `application/x-bzip` instead of `application/x-bzip2`. -- This message was sent by Atlassian Jira (v8.20.10#820010)