[ 
https://issues.apache.org/jira/browse/TIKA-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145299#comment-13145299
 ] 

PNS commented on TIKA-697:
--------------------------

Detection of Unix AR archive types (see http://en.wikipedia.org/wiki/Ar_(Unix)) 
is very simple and can indeed be done either by checking for the 8 "magic" 
bytes (0x21, 0x3C, 0x61, 0x72, 0x63, 0x68, 0x3E, 0x0A).

What needs to be changed in the Tika code is at least the TextDetector.detect() 
method, so that it returns an AR media type if the first 8 bytes of the archive 
are the AR signature.

The AR MediaType needs to be added in class org.apache.tika.mime.MediaType and 
it will probably be a custom one, since apparently there is no IANA-registered 
MIME type for AR (see http://en.wikipedia.org/wiki/List_of_archive_formats and 
http://www.iana.org/assignments/media-types/index.html).

Assuming the existence of a statement like

        public static final MediaType APPLICATION_AR = application("x-ar");

in class org.apache.tika.mime.MediaType, following is a quick implementation of 
the proposed changes in the TextDetector.detect() method:

        // Code immediately after the static initialization block of the 
IS_CONTROL[] array

        private static final byte[] AR_HEADER = new byte[]
                             {0x21, 0x3c, 0x61, 0x72, 0x63, 0x68, 0x3e, 0x0a};
        private boolean checkArHeader;

        @Override
        public MediaType detect(InputStream input, Metadata metadata)
        throws IOException {
                if (input == null) {
                        return MediaType.OCTET_STREAM;
                }

                input.mark(NUMBER_OF_BYTES_TO_TEST);
                checkArHeader = true;
                try {
                        for (int i = 0; i < NUMBER_OF_BYTES_TO_TEST; i++) {
                                int ch = input.read();
                                if (ch == -1) {
                                        if (i > 0) {
                                                return MediaType.TEXT_PLAIN;
                                        } else {
                                                // See 
https://issues.apache.org/jira/browse/TIKA-483
                                                return MediaType.OCTET_STREAM;
                                        }
                                } else if (ch < IS_CONTROL_BYTE.length && 
IS_CONTROL_BYTE[ch]) {
                                        return MediaType.OCTET_STREAM;
                                } else if (checkArHeader) {
                                        // See 
https://issues.apache.org/jira/browse/TIKA-697
                                        if ((i>7) || (AR_HEADER[i] != ch)) {
                                                checkArHeader = false;
                                        } else if ((i==7) && (AR_HEADER[i] == 
ch)) {
                                                return MediaType.APPLICATION_AR;
                                        }
                                }
                        }
                        return MediaType.TEXT_PLAIN;
                } finally {
                        input.reset();
                }
        }

Essentially, the additions are just the new MediaType.APPLICATION_AR constant, 
the 2 new variables (AR_HEADER, checkArHeader) and the "else if 
(checkArHeader)" control block.

I have tested the above with numerous combinations of files and it works as 
expected.

                
> Tika reports the content type of AR archives as "text/plain"
> ------------------------------------------------------------
>
>                 Key: TIKA-697
>                 URL: https://issues.apache.org/jira/browse/TIKA-697
>             Project: Tika
>          Issue Type: Bug
>         Environment: Linux (CentOS 5.6)
>            Reporter: PNS
>            Priority: Trivial
>
> The Tika.detect(InputStream) method returns "text/plain" for AR archives 
> created with the Linux "Create Archive" option of Nautilus (available via 
> right-clicking on a file).
> The Apache Commons Compress "autodetection" code of the ArchiveStreamFactory 
> looks at the first 12 bytes of the stream and correctly identifies the type 
> as AR.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to