I agree with Nick.

You can better understand the magic based algorithms we're using for
detection by searching for mp4 and quicktime in this file:
https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml

A middle ground is to have the MP4 parser update the file type during the
parse. This is not as clean as having a devoted detector, but it does
prevent having to parse the file twice -- once for detection and once for
parsing.

We do this with Adobe Illustrator files and PDFs -- we need to open the
file as a PDDocument (which requires a full parse) and then look for
specifically Illustrator components.

We do have code that specifically looks for items in the user data box (at
parse time, not detection time) that would identify quicktime:
https://github.com/apache/tika/blob/ffedad80199b43aaa32fe9308abc0535f0140b16/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp4/boxes/TikaUserDataBox.java#L82

We don't currently do anything with that value. :(

If you can share example files, I could take a look at improving the magic?
But I think Nick's point about that being a whack-a-mole exercise when what
we need is container based detection is spot-on.

Best,

        Tim

On Sat, Apr 27, 2024 at 6:31 AM Nick Burch <[email protected]> wrote:

> On Fri, 26 Apr 2024, Mauler, David wrote:
> > I'm in the process of troubleshooting an issue with certain mp4 video
> > files and tika. After a bunch of digging, it appears to be related to
> > whatever ISO is set for the mp4 file. An mp4 with an ISO of
> > 14496-12:2003 will be detected as video/quicktime but an mp4 with an ISO
> > of 14496-14 is detected as video/mp4 which is what I was expecting for
> > both files.
>
> Depends where in the file the type box lives. At the moment, we only have
> mime-magic based detection for the Quicktime / MP4 family of formats. If
> the right box in the container is at the start we're ok, if it comes later
> we can't tell with just a mime magic signature
>
> What we really need is a container-aware detector for the file format,
> similar to what we have for Zip files, and for the Ogg family. That would
> properly process the file in a format-aware way, checking for the contents
> to correctly identify the type.
>
> The long-standing issue is https://issues.apache.org/jira/browse/TIKA-2935
> - do you have a few days of spare coding time you could put towards this,
> and/or a bit of budget to sponsor someone to?
>
> Thanks
> Nick
>

Reply via email to