RE: Detection problem: Parsing scientific source codes for geoscientists

Ken Krugler Wed, 22 Apr 2015 07:39:44 -0700

I'm looking into whether detection & parsing code from a previous project could 
be open-sourced.


If that happened, we'd get support for many, many languages - though not GrADS 
or NCAR.

But the infrastructure would be there to easily add support for any missing 
languages.

-- Ken

> From: Oh, Ji-Hyun (329F-Affiliate)
> Sent: April 21, 2015 10:54:16am PDT
> To: [email protected]
> Subject: Detection problem: Parsing scientific source codes for geoscientists
> 
> Hi Tika friends,
> 
> I am currently engaged in a project funded by National Science Foundation. 
> Our goal is to develop a research-friendly environment where geoscientists, 
> like me, can easily find source codes they need. According to a survey, 
> scientists spend a considerable amount of their time in processing data 
> instead of doing actual science. Based on my experience as a climate 
> scientist, there exist most frequently/typically used analysis tools in 
> atmospheric science. Therefore, it could be helpful if these tools can be 
> easily shared among scientists. The thing is that the tools are written in 
> various scientific languages, so we are trying to provide the metadata of 
> source codes stored in public repositories to help scientists select source 
> code for their own usages.
> 
> For the first step, I listed up the file formats that widely used in climate 
> science.
> 
> FORTRAN (.f, .f90, f77)
> Python (.py)
> R (.R)
> Matlab (.m)
> GrADS (Grid Analysis and Display System)
> (.gs)
> NCL (NCAR Command Language) (.ncl)
> IDL (Interactive Data Language) (.pro)
> 
> I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I 
> used Tika to obtain content type of the files (with suffix .f, f90, .m), but 
> Tika detected these files as text/plain:
> 
> ohjihyun% tika -m spctime.f
> 
> Content-Encoding: ISO-8859-1
> Content-Length: 16613
> Content-Type: text/plain; charset=ISO-8859-1
> X-Parsed-By: org.apache.tika.parser.DefaultParser
> X-Parsed-By: org.apache.tika.parser.txt.TXTParser
> resourceName: spctime.f
> 
> ohjihyun% tika -m wavelet.m
> Content-Encoding: ISO-8859-1
> Content-Length: 5868
> Content-Type: text/plain; charset=ISO-8859-1
> X-Parsed-By: org.apache.tika.parser.DefaultParser
> X-Parsed-By: org.apache.tika.parser.txt.TXTParser
> resourceName: wavelet.m
> 
> I checked Tika can give correct content type (text/x-java-source) for Java 
> file as:
> ohjihyun% tika -m UrlParser.java
> Content-Encoding: ISO-8859-1
> Content-Length: 2178
> Content-Type: text/x-java-source
> LoC: 70
> X-Parsed-By: org.apache.tika.parser.DefaultParser
> X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser
> resourceName: UrlParser.java
> 
> Should I build a parser for each file format to get an exact content-type, as 
> Java has SourceCodeParser?
> Thank you in advance for your insightful comments.
> 
> Ji-Hyun

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

RE: Detection problem: Parsing scientific source codes for geoscientists

Reply via email to