[ https://issues.apache.org/jira/browse/TIKA-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379137#comment-16379137 ]
daniel schmidt commented on TIKA-2591: -------------------------------------- It is a bit of a "useful hack" as they say. But it's also kind of weird, the code is written to depend on TarArchiveInputStream to throw an exception to be "not a tar". In this case, for these images "success" is essentially failure? It does seem odd to declare tiff a sub-type of tar, but that is where the code lead me, since Tika see's it as a tar, but then later see's it as a .tiff. Another option I considered was actually guarding the construction TarArchiveInputStream with a conditional that checked the header for the TIFF magic numbers (II/MM 49 49 2A 00 / 4D 4D 00 2A). They are there, and you can check them and go to the "simply not a tar" case without even throwing an exception. That also seemed a little goofy, but it also worked. try { tais = new TarArchiveInputStream(new ByteArrayInputStream(tarHeader)); // COMPRESS-191 - verify the header checksum if (tais.getNextTarEntry().isCheckSumOK()) { return TAR; } } catch (final Exception e) { // NOPMD // NOSONAR // can generate IllegalArgumentException as well // as IOException // autodetection, simply not a TAR // ignored } > Some tiffs (Big Endian with fax compression) are showing up as x-tarr > --------------------------------------------------------------------- > > Key: TIKA-2591 > URL: https://issues.apache.org/jira/browse/TIKA-2591 > Project: Tika > Issue Type: Bug > Components: core > Affects Versions: 1.16 > Environment: Tika, running in a java application and a unit-test > (windows and mac environments) > Reporter: daniel schmidt > Priority: Major > Labels: newbie > Fix For: 1.18 > > Original Estimate: 24h > Remaining Estimate: 24h > > I have found that a certain tiff that we manage is now reporting > application/x-tar in Tika where it previously reported as a tiff > (image/tiff). > Observe this code in ArchiveStreamFactory, detect method. > // COMPRESS-117 - improve auto-recognition > if (signatureLength >= TAR_HEADER_SIZE) { > TarArchiveInputStream tais = null; > try { > tais = new TarArchiveInputStream(new > ByteArrayInputStream(tarHeader)); > // COMPRESS-191 - verify the header checksum > if (tais.getNextTarEntry().isCheckSumOK()) { > return TAR; > } > } catch (final Exception e) { // NOPMD // NOSONAR > // can generate IllegalArgumentException as well > // as IOException > // autodetection, simply not a TAR > // ignored > } finally { > IOUtils.closeQuietly(tais); > } > What if find is that most TIFs, when they get to tais.getNextTarEntry() fail > with an exception (i.e fall into the "simply not a tar" case). However this > tiff actually does NOT fail here. This somewhat makes sense as the internal > structure of a fax compressed tifs as a tar-like structure > Note, the CompositeDetector class eventually does recognize it as a proper > tiff as it loops through its detectors in its detect method. It is detected > as tiff in the MimeTypes class, which is one of the implementations of the > Detector interface > > public MediaType detect(InputStream input, Metadata metadata) > throws IOException { > MediaType type = MediaType.OCTET_STREAM; > for (Detector detector : getDetectors()) { > //short circuit via OverrideDetector > //can't rely on ordering because subsequent detector may > //change Override's to a specialization of Override's > if (detector instanceof OverrideDetector && > metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null) { > return detector.detect(input, metadata); > } > MediaType detected = detector.detect(input, metadata); > if (registry.isSpecializationOf(detected, type)) { > type = detected; > } > } > return type; > However since Image/tiff isn't a specialization of application/x-tar it does > not replace the type with tiff. > My fix was to add a "<sub-class-of type="application/x-tar"/>" to the > definition for image/tiff in the tika-mimetypes.xml file > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)