Re: Detection problem: Parsing scientific source codes for geoscientists

Nick Burch Tue, 21 Apr 2015 17:08:09 -0700

On Tue, 21 Apr 2015, Oh, Ji-Hyun (329F-Affiliate) wrote:

For the first step, I listed up the file formats that widely used inclimate science.
FORTRAN (.f, .f90, f77)
Python (.py)
R (.R)
Matlab (.m)
GrADS (Grid Analysis and Display System)
(.gs)
NCL (NCAR Command Language) (.ncl)
IDL (Interactive Data Language) (.pro)
I checked Fortran and Matlab are included in tike-mimetypes.xml, butwhen I used Tika to obtain content type of the files (with suffix .f,f90, .m), but Tika detected these files as text/plain

Your first step them is probably to try to workout how to identify thesefiles, and add suitable mime magic for them, if possible. At the sametime, make sure the common file extensions for them are listed againsttheir mime entries, and make sure we have mime entries for all of theseformats

I'd probably recommend creating one JIRA per format with detection issues,then use that to track the work to add/expand the mime type, attach asmall sample file, add detection unit tests etc.

Should I build a parser for each file format to get an exactcontent-type, as Java has SourceCodeParser?

As Lewis has said, once detection is working, you'll then want to add themissing parsers. You might find that the current SourceCodeParser could,with a little bit of work, handle some of these formats itself. Additionallibraries+parsers may well be needed for the others. I'd suggest one JIRAper format you want a parser for that we lack, then use those to track thework


Good luck!

Nick

Re: Detection problem: Parsing scientific source codes for geoscientists

Reply via email to