[jira] [Commented] (TIKA-3833) bzip2 MIME type is detected as bzip instead when using tika-core

Eduardas Kazakas (Jira) Tue, 09 Aug 2022 08:37:09 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577486#comment-17577486
 ]


Eduardas Kazakas commented on TIKA-3833:
----------------------------------------

[~tallison] I have attached a small test application. Interesting enough, when 
this works with default bzip2 compression level, it is able to detect the 
application/x-bzip2 correctly, but if I try to change the compression level, it 
falls back to application/x-bzip.

The attached archive contains several bzip2 files produced using bzip2 and 
lbzip2 commands on Ubuntu.


{code:java}
- bzip2 -z test-file-1.csv
- lbzip2 -z test-file-2.csv
- bzip2 -z empty-file.csv
- bzip2 -z small-file.csv
- lbzip2 -8 -z lbzip2-8-file.csv
- bzip2 -8 -z bzip2-8-file.csv{code}
the hexdump of the archives that were generated using default compression level 
shows this "header":
{code:java}
0000000 5a42 3968{code}
while the hexdump of the archives that were generated using compression level = 
8, returns this:
{code:java}
0000000 5a42 3868  {code}
I have also tried to compress some random text file and gave it compression 
level = 3, this was the hexdump:
{code:java}
0000000 5a42 3368 {code}
So I assume that the numbers 39, 38 and 33 (these are 9, 8 and 3 in ASCII) 
reflect the compression level of the bzip2 file and that means that the 
priority fix won't solve our issue. I believe the magic string should be 
modified or additional magic strings should be added for correct matching and 
taking this compression level number into account.

> bzip2 MIME type is detected as bzip instead when using tika-core
> ----------------------------------------------------------------
>
>                 Key: TIKA-3833
>                 URL: https://issues.apache.org/jira/browse/TIKA-3833
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.4.1
>            Reporter: Eduardas Kazakas
>            Priority: Major
>         Attachments: tika-bug.zip
>
>
> Hello, I'm having a bit of a problem when using the tika-core module (v2.4.1).
> I am trying to detect the MIME type of a bzip2 file and, instead of
> application/x-bzip2, I am getting application/x-bzip. I believe it has
> something to do with the mime-type definitions in the
> tika-mimetypes.xml file.
> {code:java}
> <mime-type type="application/x-bzip">
>   <magic priority="40">
>     <match value="BZh" type="string" offset="0"/>
>   </magic>
>   <glob pattern="*.bz"/>
>   <glob pattern="*.tbz"/>
> </mime-type>
> <mime-type type="application/x-bzip2">
>   <sub-class-of type="application/x-bzip"/>
>   <_comment>Bzip 2 UNIX Compressed File</_comment>
>   <magic priority="40">
>     <match value="\x42\x5a\x68\x39\x31" type="string" offset="0"/>
>   </magic>
>   <glob pattern="*.bz2"/>
>   <glob pattern="*.tbz2"/>
>   <glob pattern="*.boz"/>
> </mime-type>{code}
> The priority for these is set to 40, I believe that the priority of
> application/x-bzip2 should be higher, because string value "BZh" and
> hex value part "\x42\x5a\x68" are equal. x42\x5a\x68 = BZh.
> Maybe I am missing something here? Does this look like a bug or this
> works as intended? Maybe I can provide some sort of hint for the
> default detector?
> A small example in Scala:
> {code:java}
> import org.apache.tika.config.TikaConfig
> import org.apache.tika.detect.DefaultProbDetector
> import org.apache.tika.metadata.{Metadata, TikaCoreProperties}
> import java.io.{BufferedInputStream, File, FileInputStream}
> object AAA {
>   def main(args: Array[String]): Unit = {
>     val config = TikaConfig.getDefaultConfig
>     val file = new File("/home/ekazakas/test.csv.bz2")
>     val detector = new DefaultProbDetector()
>     val mediaType = detector.detect(new BufferedInputStream(new 
> FileInputStream(file)), new Metadata)
>     val mimeType = config.getMimeRepository.forName(mediaType.toString)
>     println(mimeType)
>   }
> } {code}
> This prints `application/x-bzip` instead of `application/x-bzip2`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3833) bzip2 MIME type is detected as bzip instead when using tika-core

Reply via email to