Thank you very much for detailed guidance, Nick. I was also thinking that I need to solve first how to detect each format. I will try to create JIRA per format one by one so that we can track the work related to each file format.
Jihyun ________________________________________ From: Nick Burch [[email protected]] Sent: Tuesday, April 21, 2015 5:06 PM To: [email protected] Subject: Re: Detection problem: Parsing scientific source codes for geoscientists On Tue, 21 Apr 2015, Oh, Ji-Hyun (329F-Affiliate) wrote: > For the first step, I listed up the file formats that widely used in > climate science. > > FORTRAN (.f, .f90, f77) > Python (.py) > R (.R) > Matlab (.m) > GrADS (Grid Analysis and Display System) > (.gs) > NCL (NCAR Command Language) (.ncl) > IDL (Interactive Data Language) (.pro) > > I checked Fortran and Matlab are included in tike-mimetypes.xml, but > when I used Tika to obtain content type of the files (with suffix .f, > f90, .m), but Tika detected these files as text/plain Your first step them is probably to try to workout how to identify these files, and add suitable mime magic for them, if possible. At the same time, make sure the common file extensions for them are listed against their mime entries, and make sure we have mime entries for all of these formats I'd probably recommend creating one JIRA per format with detection issues, then use that to track the work to add/expand the mime type, attach a small sample file, add detection unit tests etc. > Should I build a parser for each file format to get an exact > content-type, as Java has SourceCodeParser? As Lewis has said, once detection is working, you'll then want to add the missing parsers. You might find that the current SourceCodeParser could, with a little bit of work, handle some of these formats itself. Additional libraries+parsers may well be needed for the others. I'd suggest one JIRA per format you want a parser for that we lack, then use those to track the work Good luck! Nick
