Detection problem: Parsing scientific source codes for geoscientists

Oh, Ji-Hyun (329F-Affiliate) Tue, 21 Apr 2015 10:55:47 -0700

Hi Tika friends,

I am currently engaged in a project funded by National Science Foundation. Our 
goal is to develop a research-friendly environment where geoscientists, like 
me, can easily find source codes they need. According to a survey, scientists 
spend a considerable amount of their time in processing data instead of doing 
actual science. Based on my experience as a climate scientist, there exist most 
frequently/typically used analysis tools in atmospheric science. Therefore, it 
could be helpful if these tools can be easily shared among scientists. The 
thing is that the tools are written in various scientific languages, so we are 
trying to provide the metadata of source codes stored in public repositories to 
help scientists select source code for their own usages.


For the first step, I listed up the file formats that widely used in climate 
science.

FORTRAN (.f, .f90, f77)
Python (.py)
R (.R)
Matlab (.m)
GrADS (Grid Analysis and Display System)
(.gs)
NCL (NCAR Command Language) (.ncl)
IDL (Interactive Data Language) (.pro)

I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I 
used Tika to obtain content type of the files (with suffix .f, f90, .m), but 
Tika detected these files as text/plain:

ohjihyun% tika -m spctime.f

Content-Encoding: ISO-8859-1
Content-Length: 16613
Content-Type: text/plain; charset=ISO-8859-1
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.txt.TXTParser
resourceName: spctime.f

ohjihyun% tika -m wavelet.m
Content-Encoding: ISO-8859-1
Content-Length: 5868
Content-Type: text/plain; charset=ISO-8859-1
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.txt.TXTParser
resourceName: wavelet.m

I checked Tika can give correct content type (text/x-java-source) for Java file 
as:
ohjihyun% tika -m UrlParser.java
Content-Encoding: ISO-8859-1
Content-Length: 2178
Content-Type: text/x-java-source
LoC: 70
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser
resourceName: UrlParser.java

Should I build a parser for each file format to get an exact content-type, as 
Java has SourceCodeParser?
Thank you in advance for your insightful comments.

Ji-Hyun

Detection problem: Parsing scientific source codes for geoscientists

Reply via email to