Hello,

I'm in the process of troubleshooting an issue with certain mp4 video files and 
tika. After a bunch of digging, it appears to be related to whatever ISO is set 
for the mp4 file. An mp4 with an ISO of 14496-12:2003 will be detected as 
video/quicktime but an mp4 with an ISO of 14496-14 is detected as video/mp4 
which is what I was expecting for both files.

I've tried troubleshooting this based on the following discussion on 
stackoverflow using the answer by Dasch33.
https://stackoverflow.com/questions/48021617/why-does-apache-tika-with-an-get-mp4-file-respond-with-a-content-type-of-video

Using their test code does return a different result. It shows the problematic 
file as application/mp4, which if I'm understanding how mp4 files are handled 
this should only occur if the file is empty, has no video and audio or is 
subtitles only. However, the files I've been provided to test with contain 
audio and video. Further I've generated an mp4 video using ffmpeg with the 
older and newer ISO standard and I can replicate the issue.

Below is some example test code that I used to replicate the issue.

public class Main {
    public static void main(String[] args) throws IOException {
        //create mp4 that's detected as video/quicktime - IS0 14496-12:2003
        //ffmpeg -t 60 -f lavfi -i color=c=black:s=640x480 -c:v libx264 -tune 
stillimage -pix_fmt yuv420p output.mp4

        //convert to mp4 that's detected as video/mp4 - ISO 14496-14
        //ffmpeg -i output.mp4 -c copy -map 0 -brand mp42 output_mp42.mp4

        System.out.println("Tika test running...");

        File fileDisplaysAsVideoQuickTime = new File("output.mp4");
        File fileDisplayAsVideoMP4 = new File("output_mp42.mp4");

        printData(fileDisplaysAsVideoQuickTime);
        System.out.println("\n\n");
        
System.out.println("--------------------------------------------------");
        printData(fileDisplayAsVideoMP4);
    }

    public static void printData(final File file) {
        String result;

        //Current detection code from prod
        try (final TikaInputStream tikaIS = TikaInputStream.get(new 
FileInputStream(file))) {

            final Metadata metadata = new Metadata();
            final Detector detector = new 
DefaultDetector(MimeTypes.getDefaultMimeTypes());

            result = detector.detect(tikaIS, metadata).toString();

            System.out.println(result);
            System.out.println();
        } catch (Exception e) {
            e.printStackTrace();
        }

        //Detection code based on stackoverflow example
        
//stackoverflow.com/questions/48021617/why-does-apache-tika-with-an-get-mp4-file-respond-with-a-content-type-of-video
        try (final TikaInputStream tikaIS = TikaInputStream.get(new 
FileInputStream(file))) {
            BodyContentHandler handler = new BodyContentHandler();
            Metadata metadata = new Metadata();

            ParseContext pcontext = new ParseContext();

            MP4Parser MP4Parser = new MP4Parser();
            MP4Parser.parse(tikaIS, handler, metadata,pcontext);

            System.out.println("Contents of the document:  :" + 
handler.toString());
            System.out.println("Metadata of the document:");
            String[] metadataNames = metadata.names();

            for(String name : metadataNames) {
                System.out.println(name + ": " + metadata.get(name));
            }

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Maven dependencies
<dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-core</artifactId>
            <version>1.22</version>
        </dependency>
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>1.22</version>
        </dependency>
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parser-audiovideo-module</artifactId>
            <version>2.9.2</version>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.15.1</version>
    </dependency>

Any insights or documentations that I should read would be greatly appreciated.

Thanks,
David

Reply via email to