Let's move discussion here: https://issues.apache.org/jira/browse/TIKA-3991

@Richard, if you'd like access to our JIRA, see:
https://selfserve.apache.org/jira-account.html

On Wed, Mar 22, 2023 at 10:22 AM Tim Allison <talli...@apache.org> wrote:
>
> Thank you, Richard, for raising this.  In looking at these file
> formats, it looks like crw is based on ciff, cr2 is based on tiff and
> cr3 is based on quicktime.
>
> For some file formats we do, application/x-this-app; version=1.0,
> application/x-thisapp; version=2.0.  For others, we create separate
> main mimes as you've done:
>
> image/x-raw-canon
> image/x-raw-canon2
> image/x-raw-canon3
>
> I think we want to keep cr2 as subtype of TIFF and cr3 as subtype of
> mpeg/quicktime so that those parsers automatically correctly pick them
> up.
>
> Another option would be to subtype cr2 to crw and cr3 to crw, but then
> add cr2 to a supported format in our TIFFParser and cr3 to our
> mpeg/quicktime parser.
>
> Perhaps Nick might chime in on how we want to handle this.
>
> I think we should improve our detection of these at the very least.  I
> found some examples for cr2, if we can get examples for crw and cr3,
> that'd be helpful.  The dropfiles link isn't working for me at the
> moment. :(
>
> Some useful links (I want to document these for me.  You probably
> already know them!)
> [0] https://exiftool.org/canon_raw.html
> http://fileformats.archiveteam.org/wiki/CR2
> https://github.com/lclevy/canon_cr3
>
> On Wed, Mar 22, 2023 at 5:12 AM Richard Toolan
> <richard.too...@synchronoss.com> wrote:
> >
> > Hello,
> >
> >
> >
> > We’ve noticed that Tika is incorrectly detecting the file .cr3 as 
> > video/quicktime, other raw files are detected as image/tiff (including the 
> > .cr3’s predecessor the .cr2). I’ve uploaded a sample file here 
> > https://dropfiles.org/j8CS4Snr (that was taken from this review 
> > https://www.photographyblog.com/reviews/canon_eos_r10_review#google_vignette)
> >
> >
> >
> > When we add a custom-mimetypes.xml file with a mime-type entry like this:
> >
> > <mime-type type="image/x-raw-canon">
> >   <_comment>Canon raw image</_comment>
> >   <sub-class-of type="image/tiff"/>
> >   <glob pattern="*.crw"/>
> >   <glob pattern="*.cr2"/>
> >   <glob pattern="*.cr3"/>
> > </mime-type>
> >
> >
> >
> > The .cr3 file is still identified as video/quicktime but when we add the 
> > below configuration Tika matches it to something close to what we want:
> >
> > <mime-type type="image/x-raw-canon3">
> >   <_comment>Canon raw image</_comment>
> >   <sub-class-of type="video/quicktime"/>
> >   <glob pattern="*.cr3"/>
> > </mime-type>
> >
> >
> >
> > But this won’t give us our desired output as we’re hoping to group all 
> > Canon raw images under the same mime-type.
> >
> >
> >
> > Do you have any ideas how to get this working?
> >
> >
> >
> > We’re using tika-core 2.7.0 in a Java 8 project.
> >
> >
> >
> > Thank you,
> >
> >
> >
> > Richard
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >

Reply via email to