[ https://issues.apache.org/jira/browse/TIKA-3833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577486#comment-17577486 ]
Eduardas Kazakas commented on TIKA-3833: ---------------------------------------- [~tallison] I have attached a small test application. Interesting enough, when this works with default bzip2 compression level, it is able to detect the application/x-bzip2 correctly, but if I try to change the compression level, it falls back to application/x-bzip. The attached archive contains several bzip2 files produced using bzip2 and lbzip2 commands on Ubuntu. {code:java} - bzip2 -z test-file-1.csv - lbzip2 -z test-file-2.csv - bzip2 -z empty-file.csv - bzip2 -z small-file.csv - lbzip2 -8 -z lbzip2-8-file.csv - bzip2 -8 -z bzip2-8-file.csv{code} the hexdump of the archives that were generated using default compression level shows this "header": {code:java} 0000000 5a42 3968{code} while the hexdump of the archives that were generated using compression level = 8, returns this: {code:java} 0000000 5a42 3868 {code} I have also tried to compress some random text file and gave it compression level = 3, this was the hexdump: {code:java} 0000000 5a42 3368 {code} So I assume that the numbers 39, 38 and 33 (these are 9, 8 and 3 in ASCII) reflect the compression level of the bzip2 file and that means that the priority fix won't solve our issue. I believe the magic string should be modified or additional magic strings should be added for correct matching and taking this compression level number into account. > bzip2 MIME type is detected as bzip instead when using tika-core > ---------------------------------------------------------------- > > Key: TIKA-3833 > URL: https://issues.apache.org/jira/browse/TIKA-3833 > Project: Tika > Issue Type: Bug > Components: core > Affects Versions: 2.4.1 > Reporter: Eduardas Kazakas > Priority: Major > Attachments: tika-bug.zip > > > Hello, I'm having a bit of a problem when using the tika-core module (v2.4.1). > I am trying to detect the MIME type of a bzip2 file and, instead of > application/x-bzip2, I am getting application/x-bzip. I believe it has > something to do with the mime-type definitions in the > tika-mimetypes.xml file. > {code:java} > <mime-type type="application/x-bzip"> > <magic priority="40"> > <match value="BZh" type="string" offset="0"/> > </magic> > <glob pattern="*.bz"/> > <glob pattern="*.tbz"/> > </mime-type> > <mime-type type="application/x-bzip2"> > <sub-class-of type="application/x-bzip"/> > <_comment>Bzip 2 UNIX Compressed File</_comment> > <magic priority="40"> > <match value="\x42\x5a\x68\x39\x31" type="string" offset="0"/> > </magic> > <glob pattern="*.bz2"/> > <glob pattern="*.tbz2"/> > <glob pattern="*.boz"/> > </mime-type>{code} > The priority for these is set to 40, I believe that the priority of > application/x-bzip2 should be higher, because string value "BZh" and > hex value part "\x42\x5a\x68" are equal. x42\x5a\x68 = BZh. > Maybe I am missing something here? Does this look like a bug or this > works as intended? Maybe I can provide some sort of hint for the > default detector? > A small example in Scala: > {code:java} > import org.apache.tika.config.TikaConfig > import org.apache.tika.detect.DefaultProbDetector > import org.apache.tika.metadata.{Metadata, TikaCoreProperties} > import java.io.{BufferedInputStream, File, FileInputStream} > object AAA { > def main(args: Array[String]): Unit = { > val config = TikaConfig.getDefaultConfig > val file = new File("/home/ekazakas/test.csv.bz2") > val detector = new DefaultProbDetector() > val mediaType = detector.detect(new BufferedInputStream(new > FileInputStream(file)), new Metadata) > val mimeType = config.getMimeRepository.forName(mediaType.toString) > println(mimeType) > } > } {code} > This prints `application/x-bzip` instead of `application/x-bzip2`. > -- This message was sent by Atlassian Jira (v8.20.10#820010)