On Tue, 21 Apr 2015, Oh, Ji-Hyun (329F-Affiliate) wrote:
For the first step, I listed up the file formats that widely used in
climate science.
FORTRAN (.f, .f90, f77)
Python (.py)
R (.R)
Matlab (.m)
GrADS (Grid Analysis and Display System)
(.gs)
NCL (NCAR Command Language) (.ncl)
IDL (Interactive Data Language) (.pro)
I checked Fortran and Matlab are included in tike-mimetypes.xml, but
when I used Tika to obtain content type of the files (with suffix .f,
f90, .m), but Tika detected these files as text/plain
Your first step them is probably to try to workout how to identify these
files, and add suitable mime magic for them, if possible. At the same
time, make sure the common file extensions for them are listed against
their mime entries, and make sure we have mime entries for all of these
formats
I'd probably recommend creating one JIRA per format with detection issues,
then use that to track the work to add/expand the mime type, attach a
small sample file, add detection unit tests etc.
Should I build a parser for each file format to get an exact
content-type, as Java has SourceCodeParser?
As Lewis has said, once detection is working, you'll then want to add the
missing parsers. You might find that the current SourceCodeParser could,
with a little bit of work, handle some of these formats itself. Additional
libraries+parsers may well be needed for the others. I'd suggest one JIRA
per format you want a parser for that we lack, then use those to track the
work
Good luck!
Nick