Hi, I'm a newb here but I've browsed the source, Jira and list archives so I think I'm getting a feel for what tika is. (Also I'll be at ApacheCon next week & look forward to the tika session).
I have some general questions about metadata keys in tika. I see some common metadata key declarations e.g. DublinCore.java and CreativeCommons.java. I also see some basic introspection in the Metadata class to get the list of keys it contains and whether an entry is mutli-valued. However for the most part the actual list of defined keys seems to be compiled knowledge in the conrete parser's source code, which the client code is presumably expected to be closely aligned with. I don't see any pattern of public declaration of key sets apart from the couple of string-constant files mentioned, nor runtime introspection (or configuration) of the keys or sets of keys a particular parser extracts. Also there doesn't appear to be any simply textual approach for key-name-collision avoidance e.g. with package-style names or other namespace convention. Has their been discussion of anything like well-defined metadata "schemae" where collections of keys supported for particular mime-types are declared and possibly extended in a predictable way? I have a few use cases in mind; one involves an configuration/deployment for a particular vertical (e.g., a records management repository for a specific industry, or a DAM repository for a specialized broadcast media producer, etc.) Here the end-user wants to aggregate a lot of "deep" extraction capability for some formats, some more or less standardized and some very domain- or organization-specific, and possibly configure which metadata is extracted from each file type based on workflow details that may change with business context. Having some high(er)-level way of determining what keys different parsers generate from each mime type and what "standard" keys are implemented may be useful. Another use case is just for the development community adding new parsers including possibly subclasses or aggregations of existing parsers; how do we keep track of all implemented keys for each mime type, avoid collisions or unintentional overwrites, etc? Thanks for discussion, & pointers to other relevant threads are welcome too. - Chris