[ https://issues.apache.org/jira/browse/TIKA-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-4220. ------------------------------- Fix Version/s: 3.0.0 2.9.3 Resolution: Fixed Many thanks to [~ggregory] and {{commons-compress}}! > Commons-compress too lenient on headless tar detection > ------------------------------------------------------ > > Key: TIKA-4220 > URL: https://issues.apache.org/jira/browse/TIKA-4220 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Minor > Fix For: 3.0.0, 2.9.3 > > > On recent regression tests on TIKA-4218, we noticed a fairly major change > with an increased rate of false positives on headless tar detection from > commons-compress. > I think for now we should copy/paste/fork the headless tar detection and > improve it/revert it or possibly remove it for our 2.9.2 release. > On this ticket, I'll look into what changed recently in headless tar > detection in commons-compress and experiment with fixing it. > One challenge is that our magic bytes detection happens _after_ our custom > detectors, which means that we can't put a low confidence on what comes out > of our custom detectors and let the magic detection fix it. We could > implement an x-tar special case, but I really don't like that. > Let's see what we can do... > The numbers below represent the number of files identified as A (in tika > 2.9.1) -> B (in tika-2.9.2-pre-rc1). > application/octet-stream -> application/x-tar 826 > multipart/appledouble -> application/x-tar 701 > image/x-tga -> application/x-tar 322 > image/vnd.microsoft.icon -> application/x-tar 312 > application/vnd.iccprofile -> application/x-tar 221 > video/mp4 -> application/x-tar 177 > audio/mpeg -> application/x-tar 59 > video/x-m4v -> application/x-tar 59 > application/x-font-printer-metric -> application/x-tar 36 > audio/mp4 -> application/x-tar 25 > application/x-tex-tfm -> application/x-tar 18 > image/x-pict -> application/x-tar 15 > image/png -> application/x-tar 8 > text/plain; charset=ISO-8859-1 -> application/x-tar 8 > application/x-endnote-style -> application/x-tar 7 > application/x-font-ttf -> application/x-tar 6 -- This message was sent by Atlassian Jira (v8.20.10#820010)