W dniu 2011-08-22 20:37, Tom Grant pisze:
Here's the use case that I'm attempting to solve.  I have a customer with
many legacy systems, some of which are completely custom.  These systems
have data files that will never be seen outside of their environment.  For
example, some are XML files with their own schemas.  Some are similar to the
new office documents and are zip files containing xml and other goodies.
Others are serialized-objects dumped to disk.  Some are similar to EDI with
a header and data body with prescribed offsets. The choices of the past
can't be undone and I'm stuck with about 30 or 40 different file types.  I
want to use Tika as the standard API to exploit those old formats.  The
customer's developers know the internals of the formats, I just need to give
them an API to map them to instead of developing stovepipes to load each
format.  The quantity of file types means that its going to take a few
months to complete and will happen a few at a time.  So I'd like to
co-locate the mimetype definition with the parser code for maintainability.

FWIW. My use case is exactly the same. Old XML formats, internal to a given organization, with custom Parsers for them. The plain, generic XML parser is insufficient (too much garbage, no metadata). We use a sort of DSL to define the XML->RDF mapping. One single mapping file describes the transformation (for our transformer) and the detection rules (for tika MimeTypes).

Antoni Myłka
antoni.my...@gmail.com

Reply via email to