RE: Detection problem: Parsing scientific source codes for geoscientists
Hi Lewis, Thank you for the help :) I will try the fortran-parser in source forge to see how they work. But as Nick pointed out, I might also modify sourceCodeParser for our purpose? Ji-Hyun From: Lewis John Mcgibbney [lewis.mcgibb...@gmail.com] Sent: Tuesday, April 21, 2015 4:26 PM To: dev@tika.apache.org Subject: Re: Detection problem: Parsing scientific source codes for geoscientists Hi Ji-Hyun, On Tue, Apr 21, 2015 at 4:15 PM, dev-digest-h...@tika.apache.org wrote: FORTRAN (.f, .f90, f77) Python (.py) R (.R) Matlab (.m) GrADS (Grid Analysis and Display System) (.gs) NCL (NCAR Command Language) (.ncl) IDL (Interactive Data Language) (.pro) NICE list I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I used Tika to obtain content type of the files (with suffix .f, f90, .m), but Tika detected these files as text/plain: ohjihyun% tika -m spctime.f Content-Encoding: ISO-8859-1 Content-Length: 16613 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: spctime.f [SNIP] Should I build a parser for each file format to get an exact content-type, as Java has SourceCodeParser? As far as I know we have no parser for Fortran documents. You could try using the following Java project http://sourceforge.net/projects/fortran-parser/ It is dual licensed under Eclipse and BSD licenses. Hope this helps. Lewis
RE: Detection problem: Parsing scientific source codes for geoscientists
Hi Ken, Thank you very much for your comment. Could you inform me what kind of previous project are you looking into? Ji-Hyun From: Ken Krugler [kkrugler_li...@transpac.com] Sent: Wednesday, April 22, 2015 7:38 AM To: dev@tika.apache.org Subject: RE: Detection problem: Parsing scientific source codes for geoscientists I'm looking into whether detection parsing code from a previous project could be open-sourced. If that happened, we'd get support for many, many languages - though not GrADS or NCAR. But the infrastructure would be there to easily add support for any missing languages. -- Ken From: Oh, Ji-Hyun (329F-Affiliate) Sent: April 21, 2015 10:54:16am PDT To: dev@tika.apache.org Subject: Detection problem: Parsing scientific source codes for geoscientists Hi Tika friends, I am currently engaged in a project funded by National Science Foundation. Our goal is to develop a research-friendly environment where geoscientists, like me, can easily find source codes they need. According to a survey, scientists spend a considerable amount of their time in processing data instead of doing actual science. Based on my experience as a climate scientist, there exist most frequently/typically used analysis tools in atmospheric science. Therefore, it could be helpful if these tools can be easily shared among scientists. The thing is that the tools are written in various scientific languages, so we are trying to provide the metadata of source codes stored in public repositories to help scientists select source code for their own usages. For the first step, I listed up the file formats that widely used in climate science. FORTRAN (.f, .f90, f77) Python (.py) R (.R) Matlab (.m) GrADS (Grid Analysis and Display System) (.gs) NCL (NCAR Command Language) (.ncl) IDL (Interactive Data Language) (.pro) I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I used Tika to obtain content type of the files (with suffix .f, f90, .m), but Tika detected these files as text/plain: ohjihyun% tika -m spctime.f Content-Encoding: ISO-8859-1 Content-Length: 16613 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: spctime.f ohjihyun% tika -m wavelet.m Content-Encoding: ISO-8859-1 Content-Length: 5868 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: wavelet.m I checked Tika can give correct content type (text/x-java-source) for Java file as: ohjihyun% tika -m UrlParser.java Content-Encoding: ISO-8859-1 Content-Length: 2178 Content-Type: text/x-java-source LoC: 70 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser resourceName: UrlParser.java Should I build a parser for each file format to get an exact content-type, as Java has SourceCodeParser? Thank you in advance for your insightful comments. Ji-Hyun -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
RE: Detection problem: Parsing scientific source codes for geoscientists
Yes! I've also explored their website, and tried to search a source code (http://opensearch.krugle.org).. great! From: Mattmann, Chris A (3980) [chris.a.mattm...@jpl.nasa.gov] Sent: Wednesday, April 22, 2015 1:18 PM To: dev@tika.apache.org Subject: Re: Detection problem: Parsing scientific source codes for geoscientists Wow Ken that would be stellar. Ji-Hyun and I are doing this work as part of the NSF EarthCube project, working with Yolanda Gil at USC/ISI: http://geosoft-earthcube.org/ Our part is Tika + Nutch + Solr over Github and geociences software. The purpose of Ji-Hyun’s postdoc is to work in that area so if Krugle would be willing to do that, it would be awesomeness. Cheers mate. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Ken Krugler kkrugler_li...@transpac.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Wednesday, April 22, 2015 at 3:58 PM To: dev@tika.apache.org dev@tika.apache.org Subject: RE: Detection problem: Parsing scientific source codes for geoscientists From: Oh, Ji-Hyun (329F-Affiliate) Sent: April 22, 2015 12:36:28pm PDT To: dev@tika.apache.org Subject: RE: Detection problem: Parsing scientific source codes for geoscientists Hi Ken, Thank you very much for your comment. Could you inform me what kind of previous project are you looking into? It's the Krugle code search product. Being sold as enterprise software, but they might be willing to open source the parsing code. -- Ken From: Ken Krugler [kkrugler_li...@transpac.com] Sent: Wednesday, April 22, 2015 7:38 AM To: dev@tika.apache.org Subject: RE: Detection problem: Parsing scientific source codes for geoscientists I'm looking into whether detection parsing code from a previous project could be open-sourced. If that happened, we'd get support for many, many languages - though not GrADS or NCAR. But the infrastructure would be there to easily add support for any missing languages. -- Ken From: Oh, Ji-Hyun (329F-Affiliate) Sent: April 21, 2015 10:54:16am PDT To: dev@tika.apache.org Subject: Detection problem: Parsing scientific source codes for geoscientists Hi Tika friends, I am currently engaged in a project funded by National Science Foundation. Our goal is to develop a research-friendly environment where geoscientists, like me, can easily find source codes they need. According to a survey, scientists spend a considerable amount of their time in processing data instead of doing actual science. Based on my experience as a climate scientist, there exist most frequently/typically used analysis tools in atmospheric science. Therefore, it could be helpful if these tools can be easily shared among scientists. The thing is that the tools are written in various scientific languages, so we are trying to provide the metadata of source codes stored in public repositories to help scientists select source code for their own usages. For the first step, I listed up the file formats that widely used in climate science. FORTRAN (.f, .f90, f77) Python (.py) R (.R) Matlab (.m) GrADS (Grid Analysis and Display System) (.gs) NCL (NCAR Command Language) (.ncl) IDL (Interactive Data Language) (.pro) I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I used Tika to obtain content type of the files (with suffix .f, f90, .m), but Tika detected these files as text/plain: ohjihyun% tika -m spctime.f Content-Encoding: ISO-8859-1 Content-Length: 16613 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: spctime.f ohjihyun% tika -m wavelet.m Content-Encoding: ISO-8859-1 Content-Length: 5868 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: wavelet.m I checked Tika can give correct content type (text/x-java-source) for Java file as: ohjihyun% tika -m UrlParser.java Content-Encoding: ISO-8859-1 Content-Length: 2178 Content-Type: text/x-java-source LoC: 70 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser resourceName: UrlParser.java Should I build a parser for each file format to get an exact content-type, as Java has
Detection problem: Parsing scientific source codes for geoscientists
Hi Tika friends, I am currently engaged in a project funded by National Science Foundation. Our goal is to develop a research-friendly environment where geoscientists, like me, can easily find source codes they need. According to a survey, scientists spend a considerable amount of their time in processing data instead of doing actual science. Based on my experience as a climate scientist, there exist most frequently/typically used analysis tools in atmospheric science. Therefore, it could be helpful if these tools can be easily shared among scientists. The thing is that the tools are written in various scientific languages, so we are trying to provide the metadata of source codes stored in public repositories to help scientists select source code for their own usages. For the first step, I listed up the file formats that widely used in climate science. FORTRAN (.f, .f90, f77) Python (.py) R (.R) Matlab (.m) GrADS (Grid Analysis and Display System) (.gs) NCL (NCAR Command Language) (.ncl) IDL (Interactive Data Language) (.pro) I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I used Tika to obtain content type of the files (with suffix .f, f90, .m), but Tika detected these files as text/plain: ohjihyun% tika -m spctime.f Content-Encoding: ISO-8859-1 Content-Length: 16613 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: spctime.f ohjihyun% tika -m wavelet.m Content-Encoding: ISO-8859-1 Content-Length: 5868 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: wavelet.m I checked Tika can give correct content type (text/x-java-source) for Java file as: ohjihyun% tika -m UrlParser.java Content-Encoding: ISO-8859-1 Content-Length: 2178 Content-Type: text/x-java-source LoC: 70 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser resourceName: UrlParser.java Should I build a parser for each file format to get an exact content-type, as Java has SourceCodeParser? Thank you in advance for your insightful comments. Ji-Hyun
Hello!
Dear all, My name is Ji-Hyun Oh. I am a Post Doc working with Dr. Chris Mattmann. To capture geoscience information, I am trying to get familiar with Tika. Although I am currently taking very baby steps with Tika, I hope I will be able to contribute to Tika in near future. Thanks! Ji-Hyun