daniel schmidt created TIKA-2591:
------------------------------------

             Summary: Some tiffs (Big Endian with fax compression) are showing 
up as x-tarr
                 Key: TIKA-2591
                 URL: https://issues.apache.org/jira/browse/TIKA-2591
             Project: Tika
          Issue Type: Bug
          Components: core
    Affects Versions: 1.16
         Environment: Tika, running in a java application and a unit-test 
(windows and mac environments)
            Reporter: daniel schmidt
             Fix For: 1.18


I have found that a certain tiff that we manage is now reporting 
application/x-tar in Tika where it previously reported as a tiff (image/tiff). 

Observe this code in ArchiveStreamFactory, detect method.

  // COMPRESS-117 - improve auto-recognition

        if (signatureLength >= TAR_HEADER_SIZE) {

            TarArchiveInputStream tais = null;

            try {

                tais = new TarArchiveInputStream(new 
ByteArrayInputStream(tarHeader));

                // COMPRESS-191 - verify the header checksum

                if (tais.getNextTarEntry().isCheckSumOK()) {

                    return TAR;

                }

            } catch (final Exception e) { // NOPMD // NOSONAR

                // can generate IllegalArgumentException as well

                // as IOException

                // autodetection, simply not a TAR

                // ignored

            } finally {

                IOUtils.closeQuietly(tais);

            }

What if find is that most TIFs, when they get to tais.getNextTarEntry() fail 
with an exception (i.e fall into the "simply not a tar" case). However this 
tiff actually does NOT fail here. This somewhat makes sense as the internal 
structure of a fax compressed tifs as a tar-like structure

Note, the CompositeDetector class eventually does recognize it as a proper tiff 
as it loops through its detectors in its detect method. It is detected as tiff 
in the MimeTypes class, which is one of the implementations of the Detector 
interface

 

    public MediaType detect(InputStream input, Metadata metadata)

            throws IOException {

        MediaType type = MediaType.OCTET_STREAM;

        for (Detector detector : getDetectors()) {

            //short circuit via OverrideDetector

            //can't rely on ordering because subsequent detector may

            //change Override's to a specialization of Override's

            if (detector instanceof OverrideDetector &&        
metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null) {

                return detector.detect(input, metadata);

            }

            MediaType detected = detector.detect(input, metadata);

            if (registry.isSpecializationOf(detected, type)) {

                type = detected;

            }

        }

        return type;

However since Image/tiff isn't a specialization of application/x-tar it does 
not replace the type with tiff.

My fix was to add a  "<sub-class-of type="application/x-tar"/>" to the 
definition for image/tiff in the tika-mimetypes.xml file

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to