RE: Detection problem: Parsing scientific source codes for geoscientists
I'm looking into whether detection parsing code from a previous project could be open-sourced. If that happened, we'd get support for many, many languages - though not GrADS or NCAR. But the infrastructure would be there to easily add support for any missing languages. -- Ken From: Oh, Ji-Hyun (329F-Affiliate) Sent: April 21, 2015 10:54:16am PDT To: dev@tika.apache.org Subject: Detection problem: Parsing scientific source codes for geoscientists Hi Tika friends, I am currently engaged in a project funded by National Science Foundation. Our goal is to develop a research-friendly environment where geoscientists, like me, can easily find source codes they need. According to a survey, scientists spend a considerable amount of their time in processing data instead of doing actual science. Based on my experience as a climate scientist, there exist most frequently/typically used analysis tools in atmospheric science. Therefore, it could be helpful if these tools can be easily shared among scientists. The thing is that the tools are written in various scientific languages, so we are trying to provide the metadata of source codes stored in public repositories to help scientists select source code for their own usages. For the first step, I listed up the file formats that widely used in climate science. FORTRAN (.f, .f90, f77) Python (.py) R (.R) Matlab (.m) GrADS (Grid Analysis and Display System) (.gs) NCL (NCAR Command Language) (.ncl) IDL (Interactive Data Language) (.pro) I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I used Tika to obtain content type of the files (with suffix .f, f90, .m), but Tika detected these files as text/plain: ohjihyun% tika -m spctime.f Content-Encoding: ISO-8859-1 Content-Length: 16613 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: spctime.f ohjihyun% tika -m wavelet.m Content-Encoding: ISO-8859-1 Content-Length: 5868 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: wavelet.m I checked Tika can give correct content type (text/x-java-source) for Java file as: ohjihyun% tika -m UrlParser.java Content-Encoding: ISO-8859-1 Content-Length: 2178 Content-Type: text/x-java-source LoC: 70 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser resourceName: UrlParser.java Should I build a parser for each file format to get an exact content-type, as Java has SourceCodeParser? Thank you in advance for your insightful comments. Ji-Hyun -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
RE: Detection problem: Parsing scientific source codes for geoscientists
Hi Lewis, Thank you for the help :) I will try the fortran-parser in source forge to see how they work. But as Nick pointed out, I might also modify sourceCodeParser for our purpose? Ji-Hyun From: Lewis John Mcgibbney [lewis.mcgibb...@gmail.com] Sent: Tuesday, April 21, 2015 4:26 PM To: dev@tika.apache.org Subject: Re: Detection problem: Parsing scientific source codes for geoscientists Hi Ji-Hyun, On Tue, Apr 21, 2015 at 4:15 PM, dev-digest-h...@tika.apache.org wrote: FORTRAN (.f, .f90, f77) Python (.py) R (.R) Matlab (.m) GrADS (Grid Analysis and Display System) (.gs) NCL (NCAR Command Language) (.ncl) IDL (Interactive Data Language) (.pro) NICE list I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I used Tika to obtain content type of the files (with suffix .f, f90, .m), but Tika detected these files as text/plain: ohjihyun% tika -m spctime.f Content-Encoding: ISO-8859-1 Content-Length: 16613 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: spctime.f [SNIP] Should I build a parser for each file format to get an exact content-type, as Java has SourceCodeParser? As far as I know we have no parser for Fortran documents. You could try using the following Java project http://sourceforge.net/projects/fortran-parser/ It is dual licensed under Eclipse and BSD licenses. Hope this helps. Lewis
RE: Detection problem: Parsing scientific source codes for geoscientists
Hi Ken, Thank you very much for your comment. Could you inform me what kind of previous project are you looking into? Ji-Hyun From: Ken Krugler [kkrugler_li...@transpac.com] Sent: Wednesday, April 22, 2015 7:38 AM To: dev@tika.apache.org Subject: RE: Detection problem: Parsing scientific source codes for geoscientists I'm looking into whether detection parsing code from a previous project could be open-sourced. If that happened, we'd get support for many, many languages - though not GrADS or NCAR. But the infrastructure would be there to easily add support for any missing languages. -- Ken From: Oh, Ji-Hyun (329F-Affiliate) Sent: April 21, 2015 10:54:16am PDT To: dev@tika.apache.org Subject: Detection problem: Parsing scientific source codes for geoscientists Hi Tika friends, I am currently engaged in a project funded by National Science Foundation. Our goal is to develop a research-friendly environment where geoscientists, like me, can easily find source codes they need. According to a survey, scientists spend a considerable amount of their time in processing data instead of doing actual science. Based on my experience as a climate scientist, there exist most frequently/typically used analysis tools in atmospheric science. Therefore, it could be helpful if these tools can be easily shared among scientists. The thing is that the tools are written in various scientific languages, so we are trying to provide the metadata of source codes stored in public repositories to help scientists select source code for their own usages. For the first step, I listed up the file formats that widely used in climate science. FORTRAN (.f, .f90, f77) Python (.py) R (.R) Matlab (.m) GrADS (Grid Analysis and Display System) (.gs) NCL (NCAR Command Language) (.ncl) IDL (Interactive Data Language) (.pro) I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I used Tika to obtain content type of the files (with suffix .f, f90, .m), but Tika detected these files as text/plain: ohjihyun% tika -m spctime.f Content-Encoding: ISO-8859-1 Content-Length: 16613 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: spctime.f ohjihyun% tika -m wavelet.m Content-Encoding: ISO-8859-1 Content-Length: 5868 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: wavelet.m I checked Tika can give correct content type (text/x-java-source) for Java file as: ohjihyun% tika -m UrlParser.java Content-Encoding: ISO-8859-1 Content-Length: 2178 Content-Type: text/x-java-source LoC: 70 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser resourceName: UrlParser.java Should I build a parser for each file format to get an exact content-type, as Java has SourceCodeParser? Thank you in advance for your insightful comments. Ji-Hyun -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
RE: Detection problem: Parsing scientific source codes for geoscientists
From: Oh, Ji-Hyun (329F-Affiliate) Sent: April 22, 2015 12:36:28pm PDT To: dev@tika.apache.org Subject: RE: Detection problem: Parsing scientific source codes for geoscientists Hi Ken, Thank you very much for your comment. Could you inform me what kind of previous project are you looking into? It's the Krugle code search product. Being sold as enterprise software, but they might be willing to open source the parsing code. -- Ken From: Ken Krugler [kkrugler_li...@transpac.com] Sent: Wednesday, April 22, 2015 7:38 AM To: dev@tika.apache.org Subject: RE: Detection problem: Parsing scientific source codes for geoscientists I'm looking into whether detection parsing code from a previous project could be open-sourced. If that happened, we'd get support for many, many languages - though not GrADS or NCAR. But the infrastructure would be there to easily add support for any missing languages. -- Ken From: Oh, Ji-Hyun (329F-Affiliate) Sent: April 21, 2015 10:54:16am PDT To: dev@tika.apache.org Subject: Detection problem: Parsing scientific source codes for geoscientists Hi Tika friends, I am currently engaged in a project funded by National Science Foundation. Our goal is to develop a research-friendly environment where geoscientists, like me, can easily find source codes they need. According to a survey, scientists spend a considerable amount of their time in processing data instead of doing actual science. Based on my experience as a climate scientist, there exist most frequently/typically used analysis tools in atmospheric science. Therefore, it could be helpful if these tools can be easily shared among scientists. The thing is that the tools are written in various scientific languages, so we are trying to provide the metadata of source codes stored in public repositories to help scientists select source code for their own usages. For the first step, I listed up the file formats that widely used in climate science. FORTRAN (.f, .f90, f77) Python (.py) R (.R) Matlab (.m) GrADS (Grid Analysis and Display System) (.gs) NCL (NCAR Command Language) (.ncl) IDL (Interactive Data Language) (.pro) I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I used Tika to obtain content type of the files (with suffix .f, f90, .m), but Tika detected these files as text/plain: ohjihyun% tika -m spctime.f Content-Encoding: ISO-8859-1 Content-Length: 16613 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: spctime.f ohjihyun% tika -m wavelet.m Content-Encoding: ISO-8859-1 Content-Length: 5868 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: wavelet.m I checked Tika can give correct content type (text/x-java-source) for Java file as: ohjihyun% tika -m UrlParser.java Content-Encoding: ISO-8859-1 Content-Length: 2178 Content-Type: text/x-java-source LoC: 70 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser resourceName: UrlParser.java Should I build a parser for each file format to get an exact content-type, as Java has SourceCodeParser? Thank you in advance for your insightful comments. Ji-Hyun -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
RE: Detection problem: Parsing scientific source codes for geoscientists
Yes! I've also explored their website, and tried to search a source code (http://opensearch.krugle.org).. great! From: Mattmann, Chris A (3980) [chris.a.mattm...@jpl.nasa.gov] Sent: Wednesday, April 22, 2015 1:18 PM To: dev@tika.apache.org Subject: Re: Detection problem: Parsing scientific source codes for geoscientists Wow Ken that would be stellar. Ji-Hyun and I are doing this work as part of the NSF EarthCube project, working with Yolanda Gil at USC/ISI: http://geosoft-earthcube.org/ Our part is Tika + Nutch + Solr over Github and geociences software. The purpose of Ji-Hyun’s postdoc is to work in that area so if Krugle would be willing to do that, it would be awesomeness. Cheers mate. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Ken Krugler kkrugler_li...@transpac.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Wednesday, April 22, 2015 at 3:58 PM To: dev@tika.apache.org dev@tika.apache.org Subject: RE: Detection problem: Parsing scientific source codes for geoscientists From: Oh, Ji-Hyun (329F-Affiliate) Sent: April 22, 2015 12:36:28pm PDT To: dev@tika.apache.org Subject: RE: Detection problem: Parsing scientific source codes for geoscientists Hi Ken, Thank you very much for your comment. Could you inform me what kind of previous project are you looking into? It's the Krugle code search product. Being sold as enterprise software, but they might be willing to open source the parsing code. -- Ken From: Ken Krugler [kkrugler_li...@transpac.com] Sent: Wednesday, April 22, 2015 7:38 AM To: dev@tika.apache.org Subject: RE: Detection problem: Parsing scientific source codes for geoscientists I'm looking into whether detection parsing code from a previous project could be open-sourced. If that happened, we'd get support for many, many languages - though not GrADS or NCAR. But the infrastructure would be there to easily add support for any missing languages. -- Ken From: Oh, Ji-Hyun (329F-Affiliate) Sent: April 21, 2015 10:54:16am PDT To: dev@tika.apache.org Subject: Detection problem: Parsing scientific source codes for geoscientists Hi Tika friends, I am currently engaged in a project funded by National Science Foundation. Our goal is to develop a research-friendly environment where geoscientists, like me, can easily find source codes they need. According to a survey, scientists spend a considerable amount of their time in processing data instead of doing actual science. Based on my experience as a climate scientist, there exist most frequently/typically used analysis tools in atmospheric science. Therefore, it could be helpful if these tools can be easily shared among scientists. The thing is that the tools are written in various scientific languages, so we are trying to provide the metadata of source codes stored in public repositories to help scientists select source code for their own usages. For the first step, I listed up the file formats that widely used in climate science. FORTRAN (.f, .f90, f77) Python (.py) R (.R) Matlab (.m) GrADS (Grid Analysis and Display System) (.gs) NCL (NCAR Command Language) (.ncl) IDL (Interactive Data Language) (.pro) I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I used Tika to obtain content type of the files (with suffix .f, f90, .m), but Tika detected these files as text/plain: ohjihyun% tika -m spctime.f Content-Encoding: ISO-8859-1 Content-Length: 16613 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: spctime.f ohjihyun% tika -m wavelet.m Content-Encoding: ISO-8859-1 Content-Length: 5868 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: wavelet.m I checked Tika can give correct content type (text/x-java-source) for Java file as: ohjihyun% tika -m UrlParser.java Content-Encoding: ISO-8859-1 Content-Length: 2178 Content-Type: text/x-java-source LoC: 70 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser resourceName: UrlParser.java Should I build a parser for each file format to get an exact content-type, as Java has
Detection problem: Parsing scientific source codes for geoscientists
Hi Tika friends, I am currently engaged in a project funded by National Science Foundation. Our goal is to develop a research-friendly environment where geoscientists, like me, can easily find source codes they need. According to a survey, scientists spend a considerable amount of their time in processing data instead of doing actual science. Based on my experience as a climate scientist, there exist most frequently/typically used analysis tools in atmospheric science. Therefore, it could be helpful if these tools can be easily shared among scientists. The thing is that the tools are written in various scientific languages, so we are trying to provide the metadata of source codes stored in public repositories to help scientists select source code for their own usages. For the first step, I listed up the file formats that widely used in climate science. FORTRAN (.f, .f90, f77) Python (.py) R (.R) Matlab (.m) GrADS (Grid Analysis and Display System) (.gs) NCL (NCAR Command Language) (.ncl) IDL (Interactive Data Language) (.pro) I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I used Tika to obtain content type of the files (with suffix .f, f90, .m), but Tika detected these files as text/plain: ohjihyun% tika -m spctime.f Content-Encoding: ISO-8859-1 Content-Length: 16613 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: spctime.f ohjihyun% tika -m wavelet.m Content-Encoding: ISO-8859-1 Content-Length: 5868 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: wavelet.m I checked Tika can give correct content type (text/x-java-source) for Java file as: ohjihyun% tika -m UrlParser.java Content-Encoding: ISO-8859-1 Content-Length: 2178 Content-Type: text/x-java-source LoC: 70 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser resourceName: UrlParser.java Should I build a parser for each file format to get an exact content-type, as Java has SourceCodeParser? Thank you in advance for your insightful comments. Ji-Hyun
Re: Detection problem: Parsing scientific source codes for geoscientists
On Tue, 21 Apr 2015, Oh, Ji-Hyun (329F-Affiliate) wrote: For the first step, I listed up the file formats that widely used in climate science. FORTRAN (.f, .f90, f77) Python (.py) R (.R) Matlab (.m) GrADS (Grid Analysis and Display System) (.gs) NCL (NCAR Command Language) (.ncl) IDL (Interactive Data Language) (.pro) I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I used Tika to obtain content type of the files (with suffix .f, f90, .m), but Tika detected these files as text/plain Your first step them is probably to try to workout how to identify these files, and add suitable mime magic for them, if possible. At the same time, make sure the common file extensions for them are listed against their mime entries, and make sure we have mime entries for all of these formats I'd probably recommend creating one JIRA per format with detection issues, then use that to track the work to add/expand the mime type, attach a small sample file, add detection unit tests etc. Should I build a parser for each file format to get an exact content-type, as Java has SourceCodeParser? As Lewis has said, once detection is working, you'll then want to add the missing parsers. You might find that the current SourceCodeParser could, with a little bit of work, handle some of these formats itself. Additional libraries+parsers may well be needed for the others. I'd suggest one JIRA per format you want a parser for that we lack, then use those to track the work Good luck! Nick
Re: Detection problem: Parsing scientific source codes for geoscientists
Hi Ji-Hyun, On Tue, Apr 21, 2015 at 4:15 PM, dev-digest-h...@tika.apache.org wrote: FORTRAN (.f, .f90, f77) Python (.py) R (.R) Matlab (.m) GrADS (Grid Analysis and Display System) (.gs) NCL (NCAR Command Language) (.ncl) IDL (Interactive Data Language) (.pro) NICE list I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I used Tika to obtain content type of the files (with suffix .f, f90, .m), but Tika detected these files as text/plain: ohjihyun% tika -m spctime.f Content-Encoding: ISO-8859-1 Content-Length: 16613 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: spctime.f [SNIP] Should I build a parser for each file format to get an exact content-type, as Java has SourceCodeParser? As far as I know we have no parser for Fortran documents. You could try using the following Java project http://sourceforge.net/projects/fortran-parser/ It is dual licensed under Eclipse and BSD licenses. Hope this helps. Lewis