Hey Steve! Nice to see you show up on the list :-) Yep, I totally agree, I have a couple of useful additions I'm going to create issues for and contribute back to Tika:
1. MinODL parser for ODL files themselves and also used in 2 below; 2. ParseContext properties identifying: - groups that are in fact ODL values, that need to be parsed with the MinODL parser (useful for NetCDF and for HDF) - what groups to select out (e.g., in HDF, by Path /Group1/SubGroup1/Property, and in NetCDF just by name) I think the combination of those will help the HDF and NetCDF parsers to become more robust, and configurable. Also, GDAL is high on my priority list. I've already built the Java bindings, but am working through some trickery with GDAL since it doesn't like the fact that Tika isn't file based, and when we use TikaInputStream, it creates a file of arbitrary extension (which ticks off GDAL as it's looking for something specific). I have a work-around though in the works... Cheers, Chris On May 26, 2011, at 4:20 AM, Steve Aulenbach wrote: > Hi Chris, > > I think your plan to improve the netCDF and HDF parsing is a great one. The > richness of a full ncdump of netCDF metadata and a full ncdump HDF-EOS > metadata would be an excellent addition to the 1.0 release of Tika. I have > discussed Tika to several science data user and they usually ask about > netCDF and HDF-EOS metadata capabilities. A GDAL parser is also a great > idea. > > Thanks, > Steve > > On Fri, May 20, 2011 at 12:22 PM, Mattmann, Chris A (388J) < > [email protected]> wrote: > >> Hey Jukka et al., >> >>> It's a few months since 0.9 and our Tika in Action book is soon ready >>> for print, so I think it's good time to start planning for the 1.0 >>> release. >> >> Looking forward to not writing anything for a while :-) I doubt it'll >> happen knowing how things go, but also really really happy with where the >> book is (and banging on those last revisions! :-) ). >> >>> >>> There are a few odds and ends that I'd still like to sort out in the >>> trunk, but overall I think we're in a pretty much ready for the switch >>> from 0.x to 1.x. >> >> +1. >> >>> >>> One major issue to be decided is whether we want to follow up with the >>> earlier intention of dropping deprecated functionality (like the >>> three-argument parse() method) before the 1.0 release. >> >> +1, I'd be fine with this. I'm a fan of following through on things that we >> say we're going to do if for no other good reason than we said we're going >> to do it. >> >> +1 to dropping the 3 arg parse method. >> >>> I think we >>> should do that and also make some other backwards-incompatible >>> cleanups while we're at it. That way we'll have less old baggage to >>> carry as we evolve through the 1.x release cycle. >> >> +1, my biggest thing to work on is improving the NetCDF and HDF parsing, >> adding an ODL parser (I'll create an issue for this), adding some spatial >> parsers (working on the GDAL one right now), and maybe some documentation on >> how to use the science data file formats. I should have time over the next >> month or so to complete these. >> >>> >>> Another thing to think about is whether we want to do a formal Apache >>> press release about Tika reaching 1.0 status. >> >> +1. I'd be happy to work with Jukka, as Nick suggested, to draft this, and >> then from there to work with Sally to make it happen. >> >> Thanks! >> >> Cheers, >> Chris >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
