RE: Detection problem: Parsing scientific source codes for geoscientists

2015-04-22 Thread Ken Krugler
I'm looking into whether detection  parsing code from a previous project could 
be open-sourced.

If that happened, we'd get support for many, many languages - though not GrADS 
or NCAR.

But the infrastructure would be there to easily add support for any missing 
languages.

-- Ken

 From: Oh, Ji-Hyun (329F-Affiliate)
 Sent: April 21, 2015 10:54:16am PDT
 To: dev@tika.apache.org
 Subject: Detection problem: Parsing scientific source codes for geoscientists
 
 Hi Tika friends,
 
 I am currently engaged in a project funded by National Science Foundation. 
 Our goal is to develop a research-friendly environment where geoscientists, 
 like me, can easily find source codes they need. According to a survey, 
 scientists spend a considerable amount of their time in processing data 
 instead of doing actual science. Based on my experience as a climate 
 scientist, there exist most frequently/typically used analysis tools in 
 atmospheric science. Therefore, it could be helpful if these tools can be 
 easily shared among scientists. The thing is that the tools are written in 
 various scientific languages, so we are trying to provide the metadata of 
 source codes stored in public repositories to help scientists select source 
 code for their own usages.
 
 For the first step, I listed up the file formats that widely used in climate 
 science.
 
 FORTRAN (.f, .f90, f77)
 Python (.py)
 R (.R)
 Matlab (.m)
 GrADS (Grid Analysis and Display System)
 (.gs)
 NCL (NCAR Command Language) (.ncl)
 IDL (Interactive Data Language) (.pro)
 
 I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I 
 used Tika to obtain content type of the files (with suffix .f, f90, .m), but 
 Tika detected these files as text/plain:
 
 ohjihyun% tika -m spctime.f
 
 Content-Encoding: ISO-8859-1
 Content-Length: 16613
 Content-Type: text/plain; charset=ISO-8859-1
 X-Parsed-By: org.apache.tika.parser.DefaultParser
 X-Parsed-By: org.apache.tika.parser.txt.TXTParser
 resourceName: spctime.f
 
 ohjihyun% tika -m wavelet.m
 Content-Encoding: ISO-8859-1
 Content-Length: 5868
 Content-Type: text/plain; charset=ISO-8859-1
 X-Parsed-By: org.apache.tika.parser.DefaultParser
 X-Parsed-By: org.apache.tika.parser.txt.TXTParser
 resourceName: wavelet.m
 
 I checked Tika can give correct content type (text/x-java-source) for Java 
 file as:
 ohjihyun% tika -m UrlParser.java
 Content-Encoding: ISO-8859-1
 Content-Length: 2178
 Content-Type: text/x-java-source
 LoC: 70
 X-Parsed-By: org.apache.tika.parser.DefaultParser
 X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser
 resourceName: UrlParser.java
 
 Should I build a parser for each file format to get an exact content-type, as 
 Java has SourceCodeParser?
 Thank you in advance for your insightful comments.
 
 Ji-Hyun

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr





--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







RE: Detection problem: Parsing scientific source codes for geoscientists

2015-04-22 Thread Oh, Ji-Hyun (329F-Affiliate)
Hi Lewis,

Thank you for the help :)
I will try the fortran-parser in source forge to see how they work. 
But as Nick pointed out, I might also modify sourceCodeParser for our purpose? 

Ji-Hyun

From: Lewis John Mcgibbney [lewis.mcgibb...@gmail.com]
Sent: Tuesday, April 21, 2015 4:26 PM
To: dev@tika.apache.org
Subject: Re: Detection problem: Parsing scientific source codes for 
geoscientists

Hi Ji-Hyun,

On Tue, Apr 21, 2015 at 4:15 PM, dev-digest-h...@tika.apache.org wrote:


 FORTRAN (.f, .f90, f77)
 Python (.py)
 R (.R)
 Matlab (.m)
 GrADS (Grid Analysis and Display System)
 (.gs)
 NCL (NCAR Command Language) (.ncl)
 IDL (Interactive Data Language) (.pro)


NICE list



 I checked Fortran and Matlab are included in tike-mimetypes.xml, but when
 I used Tika to obtain content type of the files (with suffix .f, f90, .m),
 but Tika detected these files as text/plain:

 ohjihyun% tika -m spctime.f

 Content-Encoding: ISO-8859-1
 Content-Length: 16613
 Content-Type: text/plain; charset=ISO-8859-1
 X-Parsed-By: org.apache.tika.parser.DefaultParser
 X-Parsed-By: org.apache.tika.parser.txt.TXTParser
 resourceName: spctime.f


[SNIP]


 Should I build a parser for each file format to get an exact content-type,
 as Java has SourceCodeParser?


As far as I know we have no parser for Fortran documents.
You could try using the following Java project
http://sourceforge.net/projects/fortran-parser/
It is dual licensed under Eclipse and BSD licenses.
Hope this helps.
Lewis

RE: Detection problem: Parsing scientific source codes for geoscientists

2015-04-22 Thread Oh, Ji-Hyun (329F-Affiliate)
Hi Ken,
Thank you very much for your comment. 
Could you inform me what kind of previous project are you looking into? 

Ji-Hyun

From: Ken Krugler [kkrugler_li...@transpac.com]
Sent: Wednesday, April 22, 2015 7:38 AM
To: dev@tika.apache.org
Subject: RE: Detection problem: Parsing scientific source codes for 
geoscientists

I'm looking into whether detection  parsing code from a previous project could 
be open-sourced.

If that happened, we'd get support for many, many languages - though not GrADS 
or NCAR.

But the infrastructure would be there to easily add support for any missing 
languages.

-- Ken

 From: Oh, Ji-Hyun (329F-Affiliate)
 Sent: April 21, 2015 10:54:16am PDT
 To: dev@tika.apache.org
 Subject: Detection problem: Parsing scientific source codes for geoscientists

 Hi Tika friends,

 I am currently engaged in a project funded by National Science Foundation. 
 Our goal is to develop a research-friendly environment where geoscientists, 
 like me, can easily find source codes they need. According to a survey, 
 scientists spend a considerable amount of their time in processing data 
 instead of doing actual science. Based on my experience as a climate 
 scientist, there exist most frequently/typically used analysis tools in 
 atmospheric science. Therefore, it could be helpful if these tools can be 
 easily shared among scientists. The thing is that the tools are written in 
 various scientific languages, so we are trying to provide the metadata of 
 source codes stored in public repositories to help scientists select source 
 code for their own usages.

 For the first step, I listed up the file formats that widely used in climate 
 science.

 FORTRAN (.f, .f90, f77)
 Python (.py)
 R (.R)
 Matlab (.m)
 GrADS (Grid Analysis and Display System)
 (.gs)
 NCL (NCAR Command Language) (.ncl)
 IDL (Interactive Data Language) (.pro)

 I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I 
 used Tika to obtain content type of the files (with suffix .f, f90, .m), but 
 Tika detected these files as text/plain:

 ohjihyun% tika -m spctime.f

 Content-Encoding: ISO-8859-1
 Content-Length: 16613
 Content-Type: text/plain; charset=ISO-8859-1
 X-Parsed-By: org.apache.tika.parser.DefaultParser
 X-Parsed-By: org.apache.tika.parser.txt.TXTParser
 resourceName: spctime.f

 ohjihyun% tika -m wavelet.m
 Content-Encoding: ISO-8859-1
 Content-Length: 5868
 Content-Type: text/plain; charset=ISO-8859-1
 X-Parsed-By: org.apache.tika.parser.DefaultParser
 X-Parsed-By: org.apache.tika.parser.txt.TXTParser
 resourceName: wavelet.m

 I checked Tika can give correct content type (text/x-java-source) for Java 
 file as:
 ohjihyun% tika -m UrlParser.java
 Content-Encoding: ISO-8859-1
 Content-Length: 2178
 Content-Type: text/x-java-source
 LoC: 70
 X-Parsed-By: org.apache.tika.parser.DefaultParser
 X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser
 resourceName: UrlParser.java

 Should I build a parser for each file format to get an exact content-type, as 
 Java has SourceCodeParser?
 Thank you in advance for your insightful comments.

 Ji-Hyun

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr





--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







RE: Detection problem: Parsing scientific source codes for geoscientists

2015-04-22 Thread Ken Krugler

 From: Oh, Ji-Hyun (329F-Affiliate)
 Sent: April 22, 2015 12:36:28pm PDT
 To: dev@tika.apache.org
 Subject: RE: Detection problem: Parsing scientific source codes for 
 geoscientists
 
 Hi Ken,
 Thank you very much for your comment. 
 Could you inform me what kind of previous project are you looking into? 

It's the Krugle code search product.

Being sold as enterprise software, but they might be willing to open source the 
parsing code.

-- Ken


 
 From: Ken Krugler [kkrugler_li...@transpac.com]
 Sent: Wednesday, April 22, 2015 7:38 AM
 To: dev@tika.apache.org
 Subject: RE: Detection problem: Parsing scientific source codes for 
 geoscientists
 
 I'm looking into whether detection  parsing code from a previous project 
 could be open-sourced.
 
 If that happened, we'd get support for many, many languages - though not 
 GrADS or NCAR.
 
 But the infrastructure would be there to easily add support for any missing 
 languages.
 
 -- Ken
 
 From: Oh, Ji-Hyun (329F-Affiliate)
 Sent: April 21, 2015 10:54:16am PDT
 To: dev@tika.apache.org
 Subject: Detection problem: Parsing scientific source codes for geoscientists
 
 Hi Tika friends,
 
 I am currently engaged in a project funded by National Science Foundation. 
 Our goal is to develop a research-friendly environment where geoscientists, 
 like me, can easily find source codes they need. According to a survey, 
 scientists spend a considerable amount of their time in processing data 
 instead of doing actual science. Based on my experience as a climate 
 scientist, there exist most frequently/typically used analysis tools in 
 atmospheric science. Therefore, it could be helpful if these tools can be 
 easily shared among scientists. The thing is that the tools are written in 
 various scientific languages, so we are trying to provide the metadata of 
 source codes stored in public repositories to help scientists select source 
 code for their own usages.
 
 For the first step, I listed up the file formats that widely used in climate 
 science.
 
 FORTRAN (.f, .f90, f77)
 Python (.py)
 R (.R)
 Matlab (.m)
 GrADS (Grid Analysis and Display System)
 (.gs)
 NCL (NCAR Command Language) (.ncl)
 IDL (Interactive Data Language) (.pro)
 
 I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I 
 used Tika to obtain content type of the files (with suffix .f, f90, .m), but 
 Tika detected these files as text/plain:
 
 ohjihyun% tika -m spctime.f
 
 Content-Encoding: ISO-8859-1
 Content-Length: 16613
 Content-Type: text/plain; charset=ISO-8859-1
 X-Parsed-By: org.apache.tika.parser.DefaultParser
 X-Parsed-By: org.apache.tika.parser.txt.TXTParser
 resourceName: spctime.f
 
 ohjihyun% tika -m wavelet.m
 Content-Encoding: ISO-8859-1
 Content-Length: 5868
 Content-Type: text/plain; charset=ISO-8859-1
 X-Parsed-By: org.apache.tika.parser.DefaultParser
 X-Parsed-By: org.apache.tika.parser.txt.TXTParser
 resourceName: wavelet.m
 
 I checked Tika can give correct content type (text/x-java-source) for Java 
 file as:
 ohjihyun% tika -m UrlParser.java
 Content-Encoding: ISO-8859-1
 Content-Length: 2178
 Content-Type: text/x-java-source
 LoC: 70
 X-Parsed-By: org.apache.tika.parser.DefaultParser
 X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser
 resourceName: UrlParser.java
 
 Should I build a parser for each file format to get an exact content-type, 
 as Java has SourceCodeParser?
 Thank you in advance for your insightful comments.
 
 Ji-Hyun

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







RE: Detection problem: Parsing scientific source codes for geoscientists

2015-04-22 Thread Oh, Ji-Hyun (329F-Affiliate)
Yes!
I've also explored their website, and tried to search a source code 
(http://opensearch.krugle.org).. great!




From: Mattmann, Chris A (3980) [chris.a.mattm...@jpl.nasa.gov]
Sent: Wednesday, April 22, 2015 1:18 PM
To: dev@tika.apache.org
Subject: Re: Detection problem: Parsing scientific source codes for 
geoscientists

Wow Ken that would be stellar. Ji-Hyun and I are doing this work
as part of the NSF EarthCube project, working with Yolanda Gil
at USC/ISI:

http://geosoft-earthcube.org/

Our part is Tika + Nutch + Solr over Github and geociences software.
The purpose of Ji-Hyun’s postdoc is to work in that area so if Krugle
would be willing to do that, it would be awesomeness.

Cheers mate.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Ken Krugler kkrugler_li...@transpac.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Wednesday, April 22, 2015 at 3:58 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: RE: Detection problem: Parsing scientific source codes for
geoscientists


 From: Oh, Ji-Hyun (329F-Affiliate)
 Sent: April 22, 2015 12:36:28pm PDT
 To: dev@tika.apache.org
 Subject: RE: Detection problem: Parsing scientific source codes for
geoscientists

 Hi Ken,
 Thank you very much for your comment.
 Could you inform me what kind of previous project are you looking into?

It's the Krugle code search product.

Being sold as enterprise software, but they might be willing to open
source the parsing code.

-- Ken


 
 From: Ken Krugler [kkrugler_li...@transpac.com]
 Sent: Wednesday, April 22, 2015 7:38 AM
 To: dev@tika.apache.org
 Subject: RE: Detection problem: Parsing scientific source codes for
geoscientists

 I'm looking into whether detection  parsing code from a previous
project could be open-sourced.

 If that happened, we'd get support for many, many languages - though
not GrADS or NCAR.

 But the infrastructure would be there to easily add support for any
missing languages.

 -- Ken

 From: Oh, Ji-Hyun (329F-Affiliate)
 Sent: April 21, 2015 10:54:16am PDT
 To: dev@tika.apache.org
 Subject: Detection problem: Parsing scientific source codes for
geoscientists

 Hi Tika friends,

 I am currently engaged in a project funded by National Science
Foundation. Our goal is to develop a research-friendly environment
where geoscientists, like me, can easily find source codes they need.
According to a survey, scientists spend a considerable amount of their
time in processing data instead of doing actual science. Based on my
experience as a climate scientist, there exist most
frequently/typically used analysis tools in atmospheric science.
Therefore, it could be helpful if these tools can be easily shared
among scientists. The thing is that the tools are written in various
scientific languages, so we are trying to provide the metadata of
source codes stored in public repositories to help scientists select
source code for their own usages.

 For the first step, I listed up the file formats that widely used in
climate science.

 FORTRAN (.f, .f90, f77)
 Python (.py)
 R (.R)
 Matlab (.m)
 GrADS (Grid Analysis and Display System)
 (.gs)
 NCL (NCAR Command Language) (.ncl)
 IDL (Interactive Data Language) (.pro)

 I checked Fortran and Matlab are included in tike-mimetypes.xml, but
when I used Tika to obtain content type of the files (with suffix .f,
f90, .m), but Tika detected these files as text/plain:

 ohjihyun% tika -m spctime.f

 Content-Encoding: ISO-8859-1
 Content-Length: 16613
 Content-Type: text/plain; charset=ISO-8859-1
 X-Parsed-By: org.apache.tika.parser.DefaultParser
 X-Parsed-By: org.apache.tika.parser.txt.TXTParser
 resourceName: spctime.f

 ohjihyun% tika -m wavelet.m
 Content-Encoding: ISO-8859-1
 Content-Length: 5868
 Content-Type: text/plain; charset=ISO-8859-1
 X-Parsed-By: org.apache.tika.parser.DefaultParser
 X-Parsed-By: org.apache.tika.parser.txt.TXTParser
 resourceName: wavelet.m

 I checked Tika can give correct content type (text/x-java-source) for
Java file as:
 ohjihyun% tika -m UrlParser.java
 Content-Encoding: ISO-8859-1
 Content-Length: 2178
 Content-Type: text/x-java-source
 LoC: 70
 X-Parsed-By: org.apache.tika.parser.DefaultParser
 X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser
 resourceName: UrlParser.java

 Should I build a parser for each file format to get an exact
content-type, as Java has

Detection problem: Parsing scientific source codes for geoscientists

2015-04-21 Thread Oh, Ji-Hyun (329F-Affiliate)
Hi Tika friends,

I am currently engaged in a project funded by National Science Foundation. Our 
goal is to develop a research-friendly environment where geoscientists, like 
me, can easily find source codes they need. According to a survey, scientists 
spend a considerable amount of their time in processing data instead of doing 
actual science. Based on my experience as a climate scientist, there exist most 
frequently/typically used analysis tools in atmospheric science. Therefore, it 
could be helpful if these tools can be easily shared among scientists. The 
thing is that the tools are written in various scientific languages, so we are 
trying to provide the metadata of source codes stored in public repositories to 
help scientists select source code for their own usages.

For the first step, I listed up the file formats that widely used in climate 
science.

FORTRAN (.f, .f90, f77)
Python (.py)
R (.R)
Matlab (.m)
GrADS (Grid Analysis and Display System)
(.gs)
NCL (NCAR Command Language) (.ncl)
IDL (Interactive Data Language) (.pro)

I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I 
used Tika to obtain content type of the files (with suffix .f, f90, .m), but 
Tika detected these files as text/plain:

ohjihyun% tika -m spctime.f

Content-Encoding: ISO-8859-1
Content-Length: 16613
Content-Type: text/plain; charset=ISO-8859-1
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.txt.TXTParser
resourceName: spctime.f

ohjihyun% tika -m wavelet.m
Content-Encoding: ISO-8859-1
Content-Length: 5868
Content-Type: text/plain; charset=ISO-8859-1
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.txt.TXTParser
resourceName: wavelet.m

I checked Tika can give correct content type (text/x-java-source) for Java file 
as:
ohjihyun% tika -m UrlParser.java
Content-Encoding: ISO-8859-1
Content-Length: 2178
Content-Type: text/x-java-source
LoC: 70
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser
resourceName: UrlParser.java

Should I build a parser for each file format to get an exact content-type, as 
Java has SourceCodeParser?
Thank you in advance for your insightful comments.

Ji-Hyun


Re: Detection problem: Parsing scientific source codes for geoscientists

2015-04-21 Thread Nick Burch

On Tue, 21 Apr 2015, Oh, Ji-Hyun (329F-Affiliate) wrote:
For the first step, I listed up the file formats that widely used in 
climate science.


FORTRAN (.f, .f90, f77)
Python (.py)
R (.R)
Matlab (.m)
GrADS (Grid Analysis and Display System)
(.gs)
NCL (NCAR Command Language) (.ncl)
IDL (Interactive Data Language) (.pro)

I checked Fortran and Matlab are included in tike-mimetypes.xml, but 
when I used Tika to obtain content type of the files (with suffix .f, 
f90, .m), but Tika detected these files as text/plain


Your first step them is probably to try to workout how to identify these 
files, and add suitable mime magic for them, if possible. At the same 
time, make sure the common file extensions for them are listed against 
their mime entries, and make sure we have mime entries for all of these 
formats


I'd probably recommend creating one JIRA per format with detection issues, 
then use that to track the work to add/expand the mime type, attach a 
small sample file, add detection unit tests etc.


Should I build a parser for each file format to get an exact 
content-type, as Java has SourceCodeParser?


As Lewis has said, once detection is working, you'll then want to add the 
missing parsers. You might find that the current SourceCodeParser could, 
with a little bit of work, handle some of these formats itself. Additional 
libraries+parsers may well be needed for the others. I'd suggest one JIRA 
per format you want a parser for that we lack, then use those to track the 
work


Good luck!

Nick


Re: Detection problem: Parsing scientific source codes for geoscientists

2015-04-21 Thread Lewis John Mcgibbney
Hi Ji-Hyun,

On Tue, Apr 21, 2015 at 4:15 PM, dev-digest-h...@tika.apache.org wrote:


 FORTRAN (.f, .f90, f77)
 Python (.py)
 R (.R)
 Matlab (.m)
 GrADS (Grid Analysis and Display System)
 (.gs)
 NCL (NCAR Command Language) (.ncl)
 IDL (Interactive Data Language) (.pro)


NICE list



 I checked Fortran and Matlab are included in tike-mimetypes.xml, but when
 I used Tika to obtain content type of the files (with suffix .f, f90, .m),
 but Tika detected these files as text/plain:

 ohjihyun% tika -m spctime.f

 Content-Encoding: ISO-8859-1
 Content-Length: 16613
 Content-Type: text/plain; charset=ISO-8859-1
 X-Parsed-By: org.apache.tika.parser.DefaultParser
 X-Parsed-By: org.apache.tika.parser.txt.TXTParser
 resourceName: spctime.f


[SNIP]


 Should I build a parser for each file format to get an exact content-type,
 as Java has SourceCodeParser?


As far as I know we have no parser for Fortran documents.
You could try using the following Java project
http://sourceforge.net/projects/fortran-parser/
It is dual licensed under Eclipse and BSD licenses.
Hope this helps.
Lewis