On Tue, 21 Apr 2015, Oh, Ji-Hyun (329F-Affiliate) wrote:
For the first step, I listed up the file formats that widely used in climate science.

FORTRAN (.f, .f90, f77)
Python (.py)
R (.R)
Matlab (.m)
GrADS (Grid Analysis and Display System)
(.gs)
NCL (NCAR Command Language) (.ncl)
IDL (Interactive Data Language) (.pro)

I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I used Tika to obtain content type of the files (with suffix .f, f90, .m), but Tika detected these files as text/plain

Your first step them is probably to try to workout how to identify these files, and add suitable mime magic for them, if possible. At the same time, make sure the common file extensions for them are listed against their mime entries, and make sure we have mime entries for all of these formats

I'd probably recommend creating one JIRA per format with detection issues, then use that to track the work to add/expand the mime type, attach a small sample file, add detection unit tests etc.

Should I build a parser for each file format to get an exact content-type, as Java has SourceCodeParser?

As Lewis has said, once detection is working, you'll then want to add the missing parsers. You might find that the current SourceCodeParser could, with a little bit of work, handle some of these formats itself. Additional libraries+parsers may well be needed for the others. I'd suggest one JIRA per format you want a parser for that we lack, then use those to track the work

Good luck!

Nick

Reply via email to