Re: Detection problem: Parsing scientific source codes for geoscientists

Mattmann, Chris A (3980) Wed, 22 Apr 2015 13:19:08 -0700

Wow Ken that would be stellar. Ji-Hyun and I are doing this work
as part of the NSF EarthCube project, working with Yolanda Gil
at USC/ISI:


http://geosoft-earthcube.org/

Our part is Tika + Nutch + Solr over Github and geociences software.
The purpose of Ji-Hyun’s postdoc is to work in that area so if Krugle
would be willing to do that, it would be awesomeness.

Cheers mate.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Ken Krugler <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, April 22, 2015 at 3:58 PM
To: "[email protected]" <[email protected]>
Subject: RE: Detection problem: Parsing scientific source codes for
geoscientists

>
>> From: Oh, Ji-Hyun (329F-Affiliate)
>> Sent: April 22, 2015 12:36:28pm PDT
>> To: [email protected]
>> Subject: RE: Detection problem: Parsing scientific source codes for
>>geoscientists
>> 
>> Hi Ken,
>> Thank you very much for your comment.
>> Could you inform me what kind of previous project are you looking into?
>
>It's the Krugle code search product.
>
>Being sold as enterprise software, but they might be willing to open
>source the parsing code.
>
>-- Ken
>
>
>> ________________________________________
>> From: Ken Krugler [[email protected]]
>> Sent: Wednesday, April 22, 2015 7:38 AM
>> To: [email protected]
>> Subject: RE: Detection problem: Parsing scientific source codes for
>>geoscientists
>> 
>> I'm looking into whether detection & parsing code from a previous
>>project could be open-sourced.
>> 
>> If that happened, we'd get support for many, many languages - though
>>not GrADS or NCAR.
>> 
>> But the infrastructure would be there to easily add support for any
>>missing languages.
>> 
>> -- Ken
>> 
>>> From: Oh, Ji-Hyun (329F-Affiliate)
>>> Sent: April 21, 2015 10:54:16am PDT
>>> To: [email protected]
>>> Subject: Detection problem: Parsing scientific source codes for
>>>geoscientists
>>> 
>>> Hi Tika friends,
>>> 
>>> I am currently engaged in a project funded by National Science
>>>Foundation. Our goal is to develop a research-friendly environment
>>>where geoscientists, like me, can easily find source codes they need.
>>>According to a survey, scientists spend a considerable amount of their
>>>time in processing data instead of doing actual science. Based on my
>>>experience as a climate scientist, there exist most
>>>frequently/typically used analysis tools in atmospheric science.
>>>Therefore, it could be helpful if these tools can be easily shared
>>>among scientists. The thing is that the tools are written in various
>>>scientific languages, so we are trying to provide the metadata of
>>>source codes stored in public repositories to help scientists select
>>>source code for their own usages.
>>> 
>>> For the first step, I listed up the file formats that widely used in
>>>climate science.
>>> 
>>> FORTRAN (.f, .f90, f77)
>>> Python (.py)
>>> R (.R)
>>> Matlab (.m)
>>> GrADS (Grid Analysis and Display System)
>>> (.gs)
>>> NCL (NCAR Command Language) (.ncl)
>>> IDL (Interactive Data Language) (.pro)
>>> 
>>> I checked Fortran and Matlab are included in tike-mimetypes.xml, but
>>>when I used Tika to obtain content type of the files (with suffix .f,
>>>f90, .m), but Tika detected these files as text/plain:
>>> 
>>> ohjihyun% tika -m spctime.f
>>> 
>>> Content-Encoding: ISO-8859-1
>>> Content-Length: 16613
>>> Content-Type: text/plain; charset=ISO-8859-1
>>> X-Parsed-By: org.apache.tika.parser.DefaultParser
>>> X-Parsed-By: org.apache.tika.parser.txt.TXTParser
>>> resourceName: spctime.f
>>> 
>>> ohjihyun% tika -m wavelet.m
>>> Content-Encoding: ISO-8859-1
>>> Content-Length: 5868
>>> Content-Type: text/plain; charset=ISO-8859-1
>>> X-Parsed-By: org.apache.tika.parser.DefaultParser
>>> X-Parsed-By: org.apache.tika.parser.txt.TXTParser
>>> resourceName: wavelet.m
>>> 
>>> I checked Tika can give correct content type (text/x-java-source) for
>>>Java file as:
>>> ohjihyun% tika -m UrlParser.java
>>> Content-Encoding: ISO-8859-1
>>> Content-Length: 2178
>>> Content-Type: text/x-java-source
>>> LoC: 70
>>> X-Parsed-By: org.apache.tika.parser.DefaultParser
>>> X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser
>>> resourceName: UrlParser.java
>>> 
>>> Should I build a parser for each file format to get an exact
>>>content-type, as Java has SourceCodeParser?
>>> Thank you in advance for your insightful comments.
>>> 
>>> Ji-Hyun
>
>--------------------------
>Ken Krugler
>+1 530-210-6378
>http://www.scaleunlimited.com
>custom big data solutions & training
>Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>

Re: Detection problem: Parsing scientific source codes for geoscientists

Reply via email to