[ 
https://issues.apache.org/jira/browse/TIKA-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830614#comment-17830614
 ] 

ASF GitHub Bot commented on TIKA-4220:
--------------------------------------

tballison merged PR #1687:
URL: https://github.com/apache/tika/pull/1687




> Commons-compress too lenient on headless tar detection
> ------------------------------------------------------
>
>                 Key: TIKA-4220
>                 URL: https://issues.apache.org/jira/browse/TIKA-4220
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> On recent regression tests on TIKA-4218, we noticed a fairly major change 
> with an increased rate of false positives on headless tar detection from 
> commons-compress.
> I think for now we should copy/paste/fork the headless tar detection and 
> improve it/revert it or possibly remove it for our 2.9.2 release.
> On this ticket, I'll look into what changed recently in headless tar 
> detection in commons-compress and experiment with fixing it.
> One challenge is that our magic bytes detection happens _after_ our custom 
> detectors, which means that we can't put a low confidence on what comes out 
> of our custom detectors and let the magic detection fix it. We could  
> implement an x-tar special case, but I really don't like that.
> Let's see what we can do...
> The numbers below represent the number of files identified as A (in tika 
> 2.9.1) -> B (in tika-2.9.2-pre-rc1).
> application/octet-stream -> application/x-tar 826
> multipart/appledouble -> application/x-tar    701
> image/x-tga -> application/x-tar      322
> image/vnd.microsoft.icon -> application/x-tar 312
> application/vnd.iccprofile -> application/x-tar       221
> video/mp4 -> application/x-tar        177
> audio/mpeg -> application/x-tar       59
> video/x-m4v -> application/x-tar      59
> application/x-font-printer-metric -> application/x-tar        36
> audio/mp4 -> application/x-tar        25
> application/x-tex-tfm -> application/x-tar    18
> image/x-pict -> application/x-tar     15
> image/png -> application/x-tar        8
> text/plain; charset=ISO-8859-1 -> application/x-tar   8
> application/x-endnote-style -> application/x-tar      7
> application/x-font-ttf -> application/x-tar   6



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to