Let's move discussion here: https://issues.apache.org/jira/browse/TIKA-3991
@Richard, if you'd like access to our JIRA, see: https://selfserve.apache.org/jira-account.html On Wed, Mar 22, 2023 at 10:22 AM Tim Allison <talli...@apache.org> wrote: > > Thank you, Richard, for raising this. In looking at these file > formats, it looks like crw is based on ciff, cr2 is based on tiff and > cr3 is based on quicktime. > > For some file formats we do, application/x-this-app; version=1.0, > application/x-thisapp; version=2.0. For others, we create separate > main mimes as you've done: > > image/x-raw-canon > image/x-raw-canon2 > image/x-raw-canon3 > > I think we want to keep cr2 as subtype of TIFF and cr3 as subtype of > mpeg/quicktime so that those parsers automatically correctly pick them > up. > > Another option would be to subtype cr2 to crw and cr3 to crw, but then > add cr2 to a supported format in our TIFFParser and cr3 to our > mpeg/quicktime parser. > > Perhaps Nick might chime in on how we want to handle this. > > I think we should improve our detection of these at the very least. I > found some examples for cr2, if we can get examples for crw and cr3, > that'd be helpful. The dropfiles link isn't working for me at the > moment. :( > > Some useful links (I want to document these for me. You probably > already know them!) > [0] https://exiftool.org/canon_raw.html > http://fileformats.archiveteam.org/wiki/CR2 > https://github.com/lclevy/canon_cr3 > > On Wed, Mar 22, 2023 at 5:12 AM Richard Toolan > <richard.too...@synchronoss.com> wrote: > > > > Hello, > > > > > > > > We’ve noticed that Tika is incorrectly detecting the file .cr3 as > > video/quicktime, other raw files are detected as image/tiff (including the > > .cr3’s predecessor the .cr2). I’ve uploaded a sample file here > > https://dropfiles.org/j8CS4Snr (that was taken from this review > > https://www.photographyblog.com/reviews/canon_eos_r10_review#google_vignette) > > > > > > > > When we add a custom-mimetypes.xml file with a mime-type entry like this: > > > > <mime-type type="image/x-raw-canon"> > > <_comment>Canon raw image</_comment> > > <sub-class-of type="image/tiff"/> > > <glob pattern="*.crw"/> > > <glob pattern="*.cr2"/> > > <glob pattern="*.cr3"/> > > </mime-type> > > > > > > > > The .cr3 file is still identified as video/quicktime but when we add the > > below configuration Tika matches it to something close to what we want: > > > > <mime-type type="image/x-raw-canon3"> > > <_comment>Canon raw image</_comment> > > <sub-class-of type="video/quicktime"/> > > <glob pattern="*.cr3"/> > > </mime-type> > > > > > > > > But this won’t give us our desired output as we’re hoping to group all > > Canon raw images under the same mime-type. > > > > > > > > Do you have any ideas how to get this working? > > > > > > > > We’re using tika-core 2.7.0 in a Java 8 project. > > > > > > > > Thank you, > > > > > > > > Richard > > > > > > > > > > > > > > > > > > > > > > > >