RE: Detection problem: Parsing scientific source codes for geoscientists

Oh, Ji-Hyun (329F-Affiliate) Wed, 22 Apr 2015 12:29:55 -0700

Thank you very much for detailed guidance, Nick.

I was also thinking that I need to solve first how to detect each format. 
I will try to create JIRA per format one by one so that we can track the work 
related to each file format.

Jihyun
________________________________________
From: Nick Burch [[email protected]]
Sent: Tuesday, April 21, 2015 5:06 PM
To: [email protected]
Subject: Re: Detection problem: Parsing scientific source codes for 
geoscientists

On Tue, 21 Apr 2015, Oh, Ji-Hyun (329F-Affiliate) wrote:
> For the first step, I listed up the file formats that widely used in
> climate science.
>
> FORTRAN (.f, .f90, f77)
> Python (.py)
> R (.R)
> Matlab (.m)
> GrADS (Grid Analysis and Display System)
> (.gs)
> NCL (NCAR Command Language) (.ncl)
> IDL (Interactive Data Language) (.pro)
>
> I checked Fortran and Matlab are included in tike-mimetypes.xml, but
> when I used Tika to obtain content type of the files (with suffix .f,
> f90, .m), but Tika detected these files as text/plain

Your first step them is probably to try to workout how to identify these
files, and add suitable mime magic for them, if possible. At the same
time, make sure the common file extensions for them are listed against
their mime entries, and make sure we have mime entries for all of these
formats

I'd probably recommend creating one JIRA per format with detection issues,
then use that to track the work to add/expand the mime type, attach a
small sample file, add detection unit tests etc.

> Should I build a parser for each file format to get an exact
> content-type, as Java has SourceCodeParser?

As Lewis has said, once detection is working, you'll then want to add the
missing parsers. You might find that the current SourceCodeParser could,
with a little bit of work, handle some of these formats itself. Additional
libraries+parsers may well be needed for the others. I'd suggest one JIRA
per format you want a parser for that we lack, then use those to track the
work

Good luck!

Nick

RE: Detection problem: Parsing scientific source codes for geoscientists

Reply via email to