Hi all,

Thank you for your replies, unfortunately I don't have the cycles to work on a 
proper implementation. But if I do have some personal time I may look into the 
what's needed and if I can wrap my head around it.

That said, Tim I do have two samples I generated with ffmpeg. How would you 
like me to send them to you?


  *
Dave

________________________________
From: Tim Allison <[email protected]>
Sent: Monday, April 29, 2024 10:42 AM
To: [email protected] <[email protected]>
Cc: Muruganandam, Srinivasan <[email protected]>
Subject: Re: Unexpected behavior when inspecting mp4 files with different ISO

CAUTION: This email originated from outside of Penguin Random House. Please be 
extra cautious when opening file attachments or clicking on links.


I had forgotten about this, too: 
https://github.com/apache/tika/blob/777543d0ac2051bc2dce7b719a22c94019919ffb/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L253<https://github.com/apache/tika/blob/777543d0ac2051bc2dce7b719a22c94019919ffb/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java#L253>

We try to update the media type during the parse in the above section.

On Mon, Apr 29, 2024 at 10:28 AM Tim Allison 
<[email protected]<mailto:[email protected]>> wrote:
I agree with Nick.

You can better understand the magic based algorithms we're using for detection 
by searching for mp4 and quicktime in this file: 
https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml<https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml>

A middle ground is to have the MP4 parser update the file type during the 
parse. This is not as clean as having a devoted detector, but it does prevent 
having to parse the file twice -- once for detection and once for parsing.

We do this with Adobe Illustrator files and PDFs -- we need to open the file as 
a PDDocument (which requires a full parse) and then look for specifically 
Illustrator components.

We do have code that specifically looks for items in the user data box (at 
parse time, not detection time) that would identify quicktime: 
https://github.com/apache/tika/blob/ffedad80199b43aaa32fe9308abc0535f0140b16/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp4/boxes/TikaUserDataBox.java#L82<https://github.com/apache/tika/blob/ffedad80199b43aaa32fe9308abc0535f0140b16/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/src/main/java/org/apache/tika/parser/mp4/boxes/TikaUserDataBox.java#L82>

We don't currently do anything with that value. :(

If you can share example files, I could take a look at improving the magic? But 
I think Nick's point about that being a whack-a-mole exercise when what we need 
is container based detection is spot-on.

Best,

        Tim

On Sat, Apr 27, 2024 at 6:31 AM Nick Burch 
<[email protected]<mailto:[email protected]>> wrote:
On Fri, 26 Apr 2024, Mauler, David wrote:
> I'm in the process of troubleshooting an issue with certain mp4 video
> files and tika. After a bunch of digging, it appears to be related to
> whatever ISO is set for the mp4 file. An mp4 with an ISO of
> 14496-12:2003 will be detected as video/quicktime but an mp4 with an ISO
> of 14496-14 is detected as video/mp4 which is what I was expecting for
> both files.

Depends where in the file the type box lives. At the moment, we only have
mime-magic based detection for the Quicktime / MP4 family of formats. If
the right box in the container is at the start we're ok, if it comes later
we can't tell with just a mime magic signature

What we really need is a container-aware detector for the file format,
similar to what we have for Zip files, and for the Ogg family. That would
properly process the file in a format-aware way, checking for the contents
to correctly identify the type.

The long-standing issue is 
https://issues.apache.org/jira/browse/TIKA-2935<https://issues.apache.org/jira/browse/TIKA-2935>
- do you have a few days of spare coding time you could put towards this,
and/or a bit of budget to sponsor someone to?

Thanks
Nick

Reply via email to