RE: Detection problem: Parsing scientific source codes for geoscientists

Ken Krugler Wed, 22 Apr 2015 12:59:25 -0700

> From: Oh, Ji-Hyun (329F-Affiliate)
> Sent: April 22, 2015 12:36:28pm PDT
> To: [email protected]
> Subject: RE: Detection problem: Parsing scientific source codes for 
> geoscientists
> 
> Hi Ken,
> Thank you very much for your comment. 
> Could you inform me what kind of previous project are you looking into?


It's the Krugle code search product.

Being sold as enterprise software, but they might be willing to open source the 
parsing code.

-- Ken


> ________________________________________
> From: Ken Krugler [[email protected]]
> Sent: Wednesday, April 22, 2015 7:38 AM
> To: [email protected]
> Subject: RE: Detection problem: Parsing scientific source codes for 
> geoscientists
> 
> I'm looking into whether detection & parsing code from a previous project 
> could be open-sourced.
> 
> If that happened, we'd get support for many, many languages - though not 
> GrADS or NCAR.
> 
> But the infrastructure would be there to easily add support for any missing 
> languages.
> 
> -- Ken
> 
>> From: Oh, Ji-Hyun (329F-Affiliate)
>> Sent: April 21, 2015 10:54:16am PDT
>> To: [email protected]
>> Subject: Detection problem: Parsing scientific source codes for geoscientists
>> 
>> Hi Tika friends,
>> 
>> I am currently engaged in a project funded by National Science Foundation. 
>> Our goal is to develop a research-friendly environment where geoscientists, 
>> like me, can easily find source codes they need. According to a survey, 
>> scientists spend a considerable amount of their time in processing data 
>> instead of doing actual science. Based on my experience as a climate 
>> scientist, there exist most frequently/typically used analysis tools in 
>> atmospheric science. Therefore, it could be helpful if these tools can be 
>> easily shared among scientists. The thing is that the tools are written in 
>> various scientific languages, so we are trying to provide the metadata of 
>> source codes stored in public repositories to help scientists select source 
>> code for their own usages.
>> 
>> For the first step, I listed up the file formats that widely used in climate 
>> science.
>> 
>> FORTRAN (.f, .f90, f77)
>> Python (.py)
>> R (.R)
>> Matlab (.m)
>> GrADS (Grid Analysis and Display System)
>> (.gs)
>> NCL (NCAR Command Language) (.ncl)
>> IDL (Interactive Data Language) (.pro)
>> 
>> I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I 
>> used Tika to obtain content type of the files (with suffix .f, f90, .m), but 
>> Tika detected these files as text/plain:
>> 
>> ohjihyun% tika -m spctime.f
>> 
>> Content-Encoding: ISO-8859-1
>> Content-Length: 16613
>> Content-Type: text/plain; charset=ISO-8859-1
>> X-Parsed-By: org.apache.tika.parser.DefaultParser
>> X-Parsed-By: org.apache.tika.parser.txt.TXTParser
>> resourceName: spctime.f
>> 
>> ohjihyun% tika -m wavelet.m
>> Content-Encoding: ISO-8859-1
>> Content-Length: 5868
>> Content-Type: text/plain; charset=ISO-8859-1
>> X-Parsed-By: org.apache.tika.parser.DefaultParser
>> X-Parsed-By: org.apache.tika.parser.txt.TXTParser
>> resourceName: wavelet.m
>> 
>> I checked Tika can give correct content type (text/x-java-source) for Java 
>> file as:
>> ohjihyun% tika -m UrlParser.java
>> Content-Encoding: ISO-8859-1
>> Content-Length: 2178
>> Content-Type: text/x-java-source
>> LoC: 70
>> X-Parsed-By: org.apache.tika.parser.DefaultParser
>> X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser
>> resourceName: UrlParser.java
>> 
>> Should I build a parser for each file format to get an exact content-type, 
>> as Java has SourceCodeParser?
>> Thank you in advance for your insightful comments.
>> 
>> Ji-Hyun

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

RE: Detection problem: Parsing scientific source codes for geoscientists

Reply via email to