Hi Tika friends, I am currently engaged in a project funded by National Science Foundation. Our goal is to develop a research-friendly environment where geoscientists, like me, can easily find source codes they need. According to a survey, scientists spend a considerable amount of their time in processing data instead of doing actual science. Based on my experience as a climate scientist, there exist most frequently/typically used analysis tools in atmospheric science. Therefore, it could be helpful if these tools can be easily shared among scientists. The thing is that the tools are written in various scientific languages, so we are trying to provide the metadata of source codes stored in public repositories to help scientists select source code for their own usages.
For the first step, I listed up the file formats that widely used in climate science. FORTRAN (.f, .f90, f77) Python (.py) R (.R) Matlab (.m) GrADS (Grid Analysis and Display System) (.gs) NCL (NCAR Command Language) (.ncl) IDL (Interactive Data Language) (.pro) I checked Fortran and Matlab are included in tike-mimetypes.xml, but when I used Tika to obtain content type of the files (with suffix .f, f90, .m), but Tika detected these files as text/plain: ohjihyun% tika -m spctime.f Content-Encoding: ISO-8859-1 Content-Length: 16613 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: spctime.f ohjihyun% tika -m wavelet.m Content-Encoding: ISO-8859-1 Content-Length: 5868 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: wavelet.m I checked Tika can give correct content type (text/x-java-source) for Java file as: ohjihyun% tika -m UrlParser.java Content-Encoding: ISO-8859-1 Content-Length: 2178 Content-Type: text/x-java-source LoC: 70 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.code.SourceCodeParser resourceName: UrlParser.java Should I build a parser for each file format to get an exact content-type, as Java has SourceCodeParser? Thank you in advance for your insightful comments. Ji-Hyun