Hello, I'm in the process of troubleshooting an issue with certain mp4 video files and tika. After a bunch of digging, it appears to be related to whatever ISO is set for the mp4 file. An mp4 with an ISO of 14496-12:2003 will be detected as video/quicktime but an mp4 with an ISO of 14496-14 is detected as video/mp4 which is what I was expecting for both files.
I've tried troubleshooting this based on the following discussion on stackoverflow using the answer by Dasch33. https://stackoverflow.com/questions/48021617/why-does-apache-tika-with-an-get-mp4-file-respond-with-a-content-type-of-video Using their test code does return a different result. It shows the problematic file as application/mp4, which if I'm understanding how mp4 files are handled this should only occur if the file is empty, has no video and audio or is subtitles only. However, the files I've been provided to test with contain audio and video. Further I've generated an mp4 video using ffmpeg with the older and newer ISO standard and I can replicate the issue. Below is some example test code that I used to replicate the issue. public class Main { public static void main(String[] args) throws IOException { //create mp4 that's detected as video/quicktime - IS0 14496-12:2003 //ffmpeg -t 60 -f lavfi -i color=c=black:s=640x480 -c:v libx264 -tune stillimage -pix_fmt yuv420p output.mp4 //convert to mp4 that's detected as video/mp4 - ISO 14496-14 //ffmpeg -i output.mp4 -c copy -map 0 -brand mp42 output_mp42.mp4 System.out.println("Tika test running..."); File fileDisplaysAsVideoQuickTime = new File("output.mp4"); File fileDisplayAsVideoMP4 = new File("output_mp42.mp4"); printData(fileDisplaysAsVideoQuickTime); System.out.println("\n\n"); System.out.println("--------------------------------------------------"); printData(fileDisplayAsVideoMP4); } public static void printData(final File file) { String result; //Current detection code from prod try (final TikaInputStream tikaIS = TikaInputStream.get(new FileInputStream(file))) { final Metadata metadata = new Metadata(); final Detector detector = new DefaultDetector(MimeTypes.getDefaultMimeTypes()); result = detector.detect(tikaIS, metadata).toString(); System.out.println(result); System.out.println(); } catch (Exception e) { e.printStackTrace(); } //Detection code based on stackoverflow example //stackoverflow.com/questions/48021617/why-does-apache-tika-with-an-get-mp4-file-respond-with-a-content-type-of-video try (final TikaInputStream tikaIS = TikaInputStream.get(new FileInputStream(file))) { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); ParseContext pcontext = new ParseContext(); MP4Parser MP4Parser = new MP4Parser(); MP4Parser.parse(tikaIS, handler, metadata,pcontext); System.out.println("Contents of the document: :" + handler.toString()); System.out.println("Metadata of the document:"); String[] metadataNames = metadata.names(); for(String name : metadataNames) { System.out.println(name + ": " + metadata.get(name)); } } catch (Exception e) { e.printStackTrace(); } } } Maven dependencies <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>1.22</version> </dependency> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.22</version> </dependency> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parser-audiovideo-module</artifactId> <version>2.9.2</version> </dependency> <dependency> <groupId>commons-io</groupId> <artifactId>commons-io</artifactId> <version>2.15.1</version> </dependency> Any insights or documentations that I should read would be greatly appreciated. Thanks, David
