[ https://issues.apache.org/jira/browse/TIKA-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16390247#comment-16390247 ]
Hudson commented on TIKA-2591: ------------------------------ SUCCESS: Integrated in Jenkins build Tika-trunk #1453 (See [https://builds.apache.org/job/Tika-trunk/1453/]) TIKA-2591 -- Add workaround to identify TIFFs that might confuse (tallison: [https://github.com/apache/tika/commit/462ee4744fd426cfdb12539435627b25e789c912]) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java * (edit) CHANGES.txt * (add) tika-parsers/src/test/java/org/apache/tika/parser/pkg/ZipContainerDetectorTest.java > Some tiffs (Big Endian with fax compression) are showing up as x-tarr > --------------------------------------------------------------------- > > Key: TIKA-2591 > URL: https://issues.apache.org/jira/browse/TIKA-2591 > Project: Tika > Issue Type: Bug > Components: core > Affects Versions: 1.16 > Environment: Tika, running in a java application and a unit-test > (windows and mac environments) > Reporter: daniel schmidt > Priority: Major > Labels: newbie > Fix For: 1.18, 2.0.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > I have found that a certain tiff that we manage is now reporting > application/x-tar in Tika where it previously reported as a tiff > (image/tiff). > Observe this code in ArchiveStreamFactory, detect method. > // COMPRESS-117 - improve auto-recognition > if (signatureLength >= TAR_HEADER_SIZE) { > TarArchiveInputStream tais = null; > try { > tais = new TarArchiveInputStream(new > ByteArrayInputStream(tarHeader)); > // COMPRESS-191 - verify the header checksum > if (tais.getNextTarEntry().isCheckSumOK()) { > return TAR; > } > } catch (final Exception e) { // NOPMD // NOSONAR > // can generate IllegalArgumentException as well > // as IOException > // autodetection, simply not a TAR > // ignored > } finally { > IOUtils.closeQuietly(tais); > } > What if find is that most TIFs, when they get to tais.getNextTarEntry() fail > with an exception (i.e fall into the "simply not a tar" case). However this > tiff actually does NOT fail here. This somewhat makes sense as the internal > structure of a fax compressed tifs as a tar-like structure > Note, the CompositeDetector class eventually does recognize it as a proper > tiff as it loops through its detectors in its detect method. It is detected > as tiff in the MimeTypes class, which is one of the implementations of the > Detector interface > > public MediaType detect(InputStream input, Metadata metadata) > throws IOException { > MediaType type = MediaType.OCTET_STREAM; > for (Detector detector : getDetectors()) { > //short circuit via OverrideDetector > //can't rely on ordering because subsequent detector may > //change Override's to a specialization of Override's > if (detector instanceof OverrideDetector && > metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null) { > return detector.detect(input, metadata); > } > MediaType detected = detector.detect(input, metadata); > if (registry.isSpecializationOf(detected, type)) { > type = detected; > } > } > return type; > However since Image/tiff isn't a specialization of application/x-tar it does > not replace the type with tiff. > My fix was to add a "<sub-class-of type="application/x-tar"/>" to the > definition for image/tiff in the tika-mimetypes.xml file > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)