I had forgotten about this, too: https://github.com/apache/tika/blob/777543d0ac2051bc2dce7b719a22c94019919ffb/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L253
We try to update the media type during the parse in the above section. On Mon, Apr 29, 2024 at 10:28 AM Tim Allison <[email protected]> wrote: > I agree with Nick. > > You can better understand the magic based algorithms we're using for > detection by searching for mp4 and quicktime in this file: > https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml > > A middle ground is to have the MP4 parser update the file type during the > parse. This is not as clean as having a devoted detector, but it does > prevent having to parse the file twice -- once for detection and once for > parsing. > > We do this with Adobe Illustrator files and PDFs -- we need to open the > file as a PDDocument (which requires a full parse) and then look for > specifically Illustrator components. > > We do have code that specifically looks for items in the user data box (at > parse time, not detection time) that would identify quicktime: > https://github.com/apache/tika/blob/ffedad80199b43aaa32fe9308abc0535f0140b16/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp4/boxes/TikaUserDataBox.java#L82 > > We don't currently do anything with that value. :( > > If you can share example files, I could take a look at improving the > magic? But I think Nick's point about that being a whack-a-mole exercise > when what we need is container based detection is spot-on. > > Best, > > Tim > > On Sat, Apr 27, 2024 at 6:31 AM Nick Burch <[email protected]> wrote: > >> On Fri, 26 Apr 2024, Mauler, David wrote: >> > I'm in the process of troubleshooting an issue with certain mp4 video >> > files and tika. After a bunch of digging, it appears to be related to >> > whatever ISO is set for the mp4 file. An mp4 with an ISO of >> > 14496-12:2003 will be detected as video/quicktime but an mp4 with an >> ISO >> > of 14496-14 is detected as video/mp4 which is what I was expecting for >> > both files. >> >> Depends where in the file the type box lives. At the moment, we only have >> mime-magic based detection for the Quicktime / MP4 family of formats. If >> the right box in the container is at the start we're ok, if it comes >> later >> we can't tell with just a mime magic signature >> >> What we really need is a container-aware detector for the file format, >> similar to what we have for Zip files, and for the Ogg family. That would >> properly process the file in a format-aware way, checking for the >> contents >> to correctly identify the type. >> >> The long-standing issue is >> https://issues.apache.org/jira/browse/TIKA-2935 >> - do you have a few days of spare coding time you could put towards this, >> and/or a bit of budget to sponsor someone to? >> >> Thanks >> Nick >> >
