Hello all

ApacheCon last week gave me an opportunity to learn more about Apache
Tika. If I'm understanding right, Tika and SIS have the following
similarity: both of them are designed for reading various file formats,
extracts metadata from them and map those format-specific metadata to a
common model. Some differences are:

  * Tika scope is wider than SIS one (e.g. include PDF documents).
  * Tika maps the metadata to Dublin Core, while SIS maps metadata to
    ISO 19115.
  * Reading the metadata is only the first step for SIS, the remaining
    steps being to read geometries and raster data (maybe Tika read them
    too - I do not know).


If we want to leverage each other work, one question would be whether
SIS should use Tika, or Tika should use SIS. It seems to me that SIS
stands at a lower level than Tika, and consequently the relationship
could be "Tika uses SIS (eventually in optional module)". The main
reason is that some metadata read by SIS are quite specialized. For
example there is no single Coordinate Reference System (CRS) attribute
in a NetCDF file; the CRS is rather inferred from the analysis of
numerous variables and attributes. Same for GeoTIFF format, which has an
other set of attributes totally different than the NetCDF ones. If we
read a "World File", the reader needs Well Known Text (WKT) parsing
capability. If we read a "JPEG 2000" file, the reader needs Geographic
Markup Language (GML) parsing capability. Consequently if SIS asked to
Tika "please give me the geospatial metadata of this file", how Tika
could provide the CRS? I see no easy way a software could answer this
question (even with a plain String) for various file formats without the
help of a geospatial library: SIS, GDAL or other.

Similar though can apply to other aspects, for example searching for
metadata that intersect a geographic bounding box. This task become more
complicated if we take in account map projections and the bounding boxes
that span the anti-meridian.

    Martin

Reply via email to