Hi Lewis, Thank you for the help :) I will try the fortran-parser in source forge to see how they work. But as Nick pointed out, I might also modify sourceCodeParser for our purpose?
Ji-Hyun ________________________________________ From: Lewis John Mcgibbney [lewis.mcgibb...@gmail.com] Sent: Tuesday, April 21, 2015 4:26 PM To: dev@tika.apache.org Subject: Re: Detection problem: Parsing scientific source codes for geoscientists Hi Ji-Hyun, On Tue, Apr 21, 2015 at 4:15 PM, <dev-digest-h...@tika.apache.org> wrote: > > FORTRAN (.f, .f90, f77) > Python (.py) > R (.R) > Matlab (.m) > GrADS (Grid Analysis and Display System) > (.gs) > NCL (NCAR Command Language) (.ncl) > IDL (Interactive Data Language) (.pro) > NICE list > > I checked Fortran and Matlab are included in tike-mimetypes.xml, but when > I used Tika to obtain content type of the files (with suffix .f, f90, .m), > but Tika detected these files as text/plain: > > ohjihyun% tika -m spctime.f > > Content-Encoding: ISO-8859-1 > Content-Length: 16613 > Content-Type: text/plain; charset=ISO-8859-1 > X-Parsed-By: org.apache.tika.parser.DefaultParser > X-Parsed-By: org.apache.tika.parser.txt.TXTParser > resourceName: spctime.f > > [SNIP] > Should I build a parser for each file format to get an exact content-type, > as Java has SourceCodeParser? As far as I know we have no parser for Fortran documents. You could try using the following Java project http://sourceforge.net/projects/fortran-parser/ It is dual licensed under Eclipse and BSD licenses. Hope this helps. Lewis