[ 
https://issues.apache.org/jira/browse/TIKA-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391994#comment-16391994
 ] 

Hudson commented on TIKA-2591:
------------------------------

SUCCESS: Integrated in Jenkins build tika-branch-1x #9 (See 
[https://builds.apache.org/job/tika-branch-1x/9/])
TIKA-2591 -- prevent AIOOBE when haystack shorter than needle (tallison: 
[https://github.com/apache/tika/commit/4d75a32c8e98e333b6142c3a38ec57c4f00bd78a])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java


> Some tiffs (Big Endian with fax compression) are showing up as x-tarr
> ---------------------------------------------------------------------
>
>                 Key: TIKA-2591
>                 URL: https://issues.apache.org/jira/browse/TIKA-2591
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.16
>         Environment: Tika, running in a java application and a unit-test 
> (windows and mac environments)
>            Reporter: daniel schmidt
>            Priority: Major
>              Labels: newbie
>             Fix For: 1.18, 2.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have found that a certain tiff that we manage is now reporting 
> application/x-tar in Tika where it previously reported as a tiff 
> (image/tiff). 
> Observe this code in ArchiveStreamFactory, detect method.
>   // COMPRESS-117 - improve auto-recognition
>         if (signatureLength >= TAR_HEADER_SIZE) {
>             TarArchiveInputStream tais = null;
>             try {
>                 tais = new TarArchiveInputStream(new 
> ByteArrayInputStream(tarHeader));
>                 // COMPRESS-191 - verify the header checksum
>                 if (tais.getNextTarEntry().isCheckSumOK()) {
>                     return TAR;
>                 }
>             } catch (final Exception e) { // NOPMD // NOSONAR
>                 // can generate IllegalArgumentException as well
>                 // as IOException
>                 // autodetection, simply not a TAR
>                 // ignored
>             } finally {
>                 IOUtils.closeQuietly(tais);
>             }
> What if find is that most TIFs, when they get to tais.getNextTarEntry() fail 
> with an exception (i.e fall into the "simply not a tar" case). However this 
> tiff actually does NOT fail here. This somewhat makes sense as the internal 
> structure of a fax compressed tifs as a tar-like structure
> Note, the CompositeDetector class eventually does recognize it as a proper 
> tiff as it loops through its detectors in its detect method. It is detected 
> as tiff in the MimeTypes class, which is one of the implementations of the 
> Detector interface
>  
>     public MediaType detect(InputStream input, Metadata metadata)
>             throws IOException {
>         MediaType type = MediaType.OCTET_STREAM;
>         for (Detector detector : getDetectors()) {
>             //short circuit via OverrideDetector
>             //can't rely on ordering because subsequent detector may
>             //change Override's to a specialization of Override's
>             if (detector instanceof OverrideDetector &&        
> metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null) {
>                 return detector.detect(input, metadata);
>             }
>             MediaType detected = detector.detect(input, metadata);
>             if (registry.isSpecializationOf(detected, type)) {
>                 type = detected;
>             }
>         }
>         return type;
> However since Image/tiff isn't a specialization of application/x-tar it does 
> not replace the type with tiff.
> My fix was to add a  "<sub-class-of type="application/x-tar"/>" to the 
> definition for image/tiff in the tika-mimetypes.xml file
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to