I'm looking into whether detection & parsing code from a previous project could be open-sourced.
If that happened, we'd get support for many, many languages - though not GrADS or NCAR. But the infrastructure would be there to easily add support for any missing languages. -- Ken > From: Oh, Ji-Hyun (329F-Affiliate) > Sent: April 21, 2015 10:54:16am PDT > To: [email protected] > Subject: Detection problem: Parsing scientific source codes for geoscientists > > Hi Tika friends, > > I am currently engaged in a project funded by National Science Foundation. > Our goal is to develop a research-friendly environment where geoscientists, > like me, can easily find source codes they need. According to a survey, > scientists spend a considerable amount of their time in processing data > instead of doing actual science. Based on my experience as a climate > scientist, there exist most frequently/typically used analysis tools in > atmospheric science. Therefore, it could be helpful if these tools can be > easily shared among scientists. The thing is that the tools are written in > various scientific languages, so we are trying to provide the metadata of > source codes stored in public repositories to help scientists select source > code for their own usages. > > For the first step, I listed up the file formats that widely used in climate > science. > > FORTRAN (.f, .f90, f77) > Python (.py) > R (.R) > Matlab (.m) > GrADS (Grid Analysis and Display System) > (.gs) > NCL (NCAR Command Language) (.ncl) > IDL (Interactive Data Language) (.pro) > > I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I > used Tika to obtain content type of the files (with suffix .f, f90, .m), but > Tika detected these files as text/plain: > > ohjihyun% tika -m spctime.f > > Content-Encoding: ISO-8859-1 > Content-Length: 16613 > Content-Type: text/plain; charset=ISO-8859-1 > X-Parsed-By: org.apache.tika.parser.DefaultParser > X-Parsed-By: org.apache.tika.parser.txt.TXTParser > resourceName: spctime.f > > ohjihyun% tika -m wavelet.m > Content-Encoding: ISO-8859-1 > Content-Length: 5868 > Content-Type: text/plain; charset=ISO-8859-1 > X-Parsed-By: org.apache.tika.parser.DefaultParser > X-Parsed-By: org.apache.tika.parser.txt.TXTParser > resourceName: wavelet.m > > I checked Tika can give correct content type (text/x-java-source) for Java > file as: > ohjihyun% tika -m UrlParser.java > Content-Encoding: ISO-8859-1 > Content-Length: 2178 > Content-Type: text/x-java-source > LoC: 70 > X-Parsed-By: org.apache.tika.parser.DefaultParser > X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser > resourceName: UrlParser.java > > Should I build a parser for each file format to get an exact content-type, as > Java has SourceCodeParser? > Thank you in advance for your insightful comments. > > Ji-Hyun -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
