[ 
https://issues.apache.org/jira/browse/TIKA-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378750#comment-16378750
 ] 

ASF GitHub Bot commented on TIKA-2591:
--------------------------------------

danschmidt1 opened a new pull request #226: fix for TIKA-2591 contributed by 
dan.schmi...@gmail.com
URL: https://github.com/apache/tika/pull/226
 
 
   I have found that a certain tiff that we manage is now reporting 
application/x-tar in Tika where it previously reported as a tiff (image/tiff). 
   
   Observe this code in ArchiveStreamFactory, detect method.
   
     // COMPRESS-117 - improve auto-recognition
   
           if (signatureLength >= TAR_HEADER_SIZE) {
   
               TarArchiveInputStream tais = null;
   
               try {
   
                   tais = new TarArchiveInputStream(new 
ByteArrayInputStream(tarHeader));
   
                   // COMPRESS-191 - verify the header checksum
   
                   if (tais.getNextTarEntry().isCheckSumOK())
   
   {                     return TAR;                 }
               } catch (final Exception e)
   
   { // NOPMD // NOSONAR                 // can generate 
IllegalArgumentException as well                 // as IOException              
   // autodetection, simply not a TAR                 // ignored             }
   finally
   
   {                 IOUtils.closeQuietly(tais);             }
   What if find is that most TIFs, when they get to tais.getNextTarEntry() fail 
with an exception (i.e fall into the "simply not a tar" case). However this 
tiff actually does NOT fail here. This somewhat makes sense as the internal 
structure of a fax compressed tifs as a tar-like structure
   
   Note, the CompositeDetector class eventually does recognize it as a proper 
tiff as it loops through its detectors in its detect method. It is detected as 
tiff in the MimeTypes class, which is one of the implementations of the 
Detector interface
   
    
   
       public MediaType detect(InputStream input, Metadata metadata)
   
               throws IOException {
   
           MediaType type = MediaType.OCTET_STREAM;
   
           for (Detector detector : getDetectors()) {
   
               //short circuit via OverrideDetector
   
               //can't rely on ordering because subsequent detector may
   
               //change Override's to a specialization of Override's
   
               if (detector instanceof OverrideDetector &&        
metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null)
   
   {                 return detector.detect(input, metadata);             }
               MediaType detected = detector.detect(input, metadata);
   
               if (registry.isSpecializationOf(detected, type))
   
   {                 type = detected;             }
           }
   
           return type;
   
   However since Image/tiff isn't a specialization of application/x-tar it does 
not replace the type with tiff.
   
   My fix was to add a  "<sub-class-of type="application/x-tar"/>" to the 
definition for image/tiff in the tika-mimetypes.xml file
   
    The file I found this on has HIPPA information on it, so I can't upload it._

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Some tiffs (Big Endian with fax compression) are showing up as x-tarr
> ---------------------------------------------------------------------
>
>                 Key: TIKA-2591
>                 URL: https://issues.apache.org/jira/browse/TIKA-2591
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.16
>         Environment: Tika, running in a java application and a unit-test 
> (windows and mac environments)
>            Reporter: daniel schmidt
>            Priority: Major
>              Labels: newbie
>             Fix For: 1.18
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have found that a certain tiff that we manage is now reporting 
> application/x-tar in Tika where it previously reported as a tiff 
> (image/tiff). 
> Observe this code in ArchiveStreamFactory, detect method.
>   // COMPRESS-117 - improve auto-recognition
>         if (signatureLength >= TAR_HEADER_SIZE) {
>             TarArchiveInputStream tais = null;
>             try {
>                 tais = new TarArchiveInputStream(new 
> ByteArrayInputStream(tarHeader));
>                 // COMPRESS-191 - verify the header checksum
>                 if (tais.getNextTarEntry().isCheckSumOK()) {
>                     return TAR;
>                 }
>             } catch (final Exception e) { // NOPMD // NOSONAR
>                 // can generate IllegalArgumentException as well
>                 // as IOException
>                 // autodetection, simply not a TAR
>                 // ignored
>             } finally {
>                 IOUtils.closeQuietly(tais);
>             }
> What if find is that most TIFs, when they get to tais.getNextTarEntry() fail 
> with an exception (i.e fall into the "simply not a tar" case). However this 
> tiff actually does NOT fail here. This somewhat makes sense as the internal 
> structure of a fax compressed tifs as a tar-like structure
> Note, the CompositeDetector class eventually does recognize it as a proper 
> tiff as it loops through its detectors in its detect method. It is detected 
> as tiff in the MimeTypes class, which is one of the implementations of the 
> Detector interface
>  
>     public MediaType detect(InputStream input, Metadata metadata)
>             throws IOException {
>         MediaType type = MediaType.OCTET_STREAM;
>         for (Detector detector : getDetectors()) {
>             //short circuit via OverrideDetector
>             //can't rely on ordering because subsequent detector may
>             //change Override's to a specialization of Override's
>             if (detector instanceof OverrideDetector &&        
> metadata.get(TikaCoreProperties.CONTENT_TYPE_OVERRIDE) != null) {
>                 return detector.detect(input, metadata);
>             }
>             MediaType detected = detector.detect(input, metadata);
>             if (registry.isSpecializationOf(detected, type)) {
>                 type = detected;
>             }
>         }
>         return type;
> However since Image/tiff isn't a specialization of application/x-tar it does 
> not replace the type with tiff.
> My fix was to add a  "<sub-class-of type="application/x-tar"/>" to the 
> definition for image/tiff in the tika-mimetypes.xml file
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to