I had forgotten about this, too:
https://github.com/apache/tika/blob/777543d0ac2051bc2dce7b719a22c94019919ffb/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L253

We try to update the media type during the parse in the above section.

On Mon, Apr 29, 2024 at 10:28 AM Tim Allison <[email protected]> wrote:

> I agree with Nick.
>
> You can better understand the magic based algorithms we're using for
> detection by searching for mp4 and quicktime in this file:
> https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
>
> A middle ground is to have the MP4 parser update the file type during the
> parse. This is not as clean as having a devoted detector, but it does
> prevent having to parse the file twice -- once for detection and once for
> parsing.
>
> We do this with Adobe Illustrator files and PDFs -- we need to open the
> file as a PDDocument (which requires a full parse) and then look for
> specifically Illustrator components.
>
> We do have code that specifically looks for items in the user data box (at
> parse time, not detection time) that would identify quicktime:
> https://github.com/apache/tika/blob/ffedad80199b43aaa32fe9308abc0535f0140b16/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp4/boxes/TikaUserDataBox.java#L82
>
> We don't currently do anything with that value. :(
>
> If you can share example files, I could take a look at improving the
> magic? But I think Nick's point about that being a whack-a-mole exercise
> when what we need is container based detection is spot-on.
>
> Best,
>
>         Tim
>
> On Sat, Apr 27, 2024 at 6:31 AM Nick Burch <[email protected]> wrote:
>
>> On Fri, 26 Apr 2024, Mauler, David wrote:
>> > I'm in the process of troubleshooting an issue with certain mp4 video
>> > files and tika. After a bunch of digging, it appears to be related to
>> > whatever ISO is set for the mp4 file. An mp4 with an ISO of
>> > 14496-12:2003 will be detected as video/quicktime but an mp4 with an
>> ISO
>> > of 14496-14 is detected as video/mp4 which is what I was expecting for
>> > both files.
>>
>> Depends where in the file the type box lives. At the moment, we only have
>> mime-magic based detection for the Quicktime / MP4 family of formats. If
>> the right box in the container is at the start we're ok, if it comes
>> later
>> we can't tell with just a mime magic signature
>>
>> What we really need is a container-aware detector for the file format,
>> similar to what we have for Zip files, and for the Ogg family. That would
>> properly process the file in a format-aware way, checking for the
>> contents
>> to correctly identify the type.
>>
>> The long-standing issue is
>> https://issues.apache.org/jira/browse/TIKA-2935
>> - do you have a few days of spare coding time you could put towards this,
>> and/or a bit of budget to sponsor someone to?
>>
>> Thanks
>> Nick
>>
>

Reply via email to