[
https://issues.apache.org/jira/browse/TIKA-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653428#comment-14653428
]
Nick Burch commented on TIKA-1691:
----------------------------------
We generally apply a higher bar to things going into Tika Core than
sub-modules, in part because they have an immediately higher impact, and the
rules on changing/deprecating/compatibility are stronger. That often means we
need to ask more questions first!
One of the contracts of the Tika metadata system is that it should provide a
format-agnostic view of the metadata as best as it can. You, as the end user of
Tika, shouldn't need to know if one format calls it Author, one Creator, one
Created By, Tika's parsers handle that mapping for you internally. If there are
cases where Tika isn't doing that properly, we want to know! We should be
adding more properties definitions, and setting those mappings in all the
parsers. That normalisation between formats is something Tika should do for
everyone, not something that individual users should need to worry themselves
with. If gaps exist, please raise tickets
There are a number of downstream users of Tika Metadata who transform/translate
the output. The Tika XMP module is one such, Alfresco's metadata extractor
mapping another, JackRabbit has one too, SOLR has one etc. We have json
serialisation as well. At least some of us would find it a bit odd to see
Alfresco metadata properties, or SOLR field definitions inside what's held in
the Tika Metadata object! All those projects seem to find it fine to read out
metadata keys+values, and map it into their own model on their side. If there's
another common downstream format, we should look to add a module / set of
serialisation classes for that too.
Is there is a use-case for runtime-specific downstream mappings, such that when
you run it on one machine and/or dataset you want dc:subject to map to
custom:Long_Title, but on another it's custom:Short_Title? If that's it, I
could probably see the case for a runtime-configurable wrapper/serializer, but
some more details on the use-case would be helpful, so we can make it easy to
use / extend / integrate with / etc.
If there's something else you're trying to do, please could you explain the
use-case some more? Possibly on a wiki page, if it gets too hard to do here,
with some examples. We're not all doing exactly the same things, so what's
obvious for one person might not be for another! We're not all rocket
scientists here... ;-) If we can get it explained, then we can all help refine
the design as needed, ensure it's as supported and widely usable as possible,
and documented in a way that new community members can understand too!
(The mapping example given in the PDF looks to be something that Tika ought to
be doing already, so if there are cases when it isn't then those are bugs!)
> Apache Tika for enabling metadata interoperability
> --------------------------------------------------
>
> Key: TIKA-1691
> URL: https://issues.apache.org/jira/browse/TIKA-1691
> Project: Tika
> Issue Type: New Feature
> Reporter: Giuseppe Totaro
> Assignee: Giuseppe Totaro
> Labels: mapping, metadata
> Attachments: mapping_example.pdf
>
>
> If am not wrong, enabling consistent metadata across file formats is already
> (partially) provided into Tika by relying on {{TikaCoreProperties}} and,
> within the context of Solr, {{ExtractingRequestHandler}} (by defining how to
> map metadata fields in {{solrconfig.xml}}). However, I am working on a new
> component for both schema mapping (to operate on the name of metadata
> properties) and instance transformation (to operate on the value of metadata)
> that consists, essentially, of the following changes:
> * A wrapper of {{Metadata}} object ({{MappedMetadata.java}}) that decorates
> the {{set}} method (currently, line number 367 of {{Metadata.java}}) by
> applying the given mapping functions (via configuration) before setting
> metadata properties.
> * Basic mapping functions ({{BasicMappingUtils.java}}) that are utility
> methods to map a set of metadata to the target schema.
> * A new {{MetadataConfig}} object that, as well as {{TikaConfig}}, may be
> configured via XML file (organized as showed in the following snippet) and
> allows to perform a fine-grained metadata mapping by using Java reflection.
> {code:xml|title=tika-metadata.xml|borderStyle=solid}
> <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> <properties>
> <mappings>
> <mapping type="type/sub-type">
> <relation name="SOURCE_FIELD">
> <target>TARGET_FIELD</target>
> <expression>exclude|include|equivalent|overlap</expression>
> <function name="FUNCTION_NAME">
> <argument>ARGUMENT_VALUE</argument>
> </function>
> <cardinality>
> <source>SOURCE_CARDINALITY</source>
> <target>TARGET_CARDINALITY</target>
> <order>ORDER_NUMBER</order>
> <dependencies>
> <field>FIELD_NAME</field>
> </dependencies>
> </cardinality>
> </relation>
> </mapping>
> ...
> <mapping> <!-- This contains the fallback strategy for unknown metadata
> -->
> <relation>
> ...
> </relation>
> <mapping>
> </mappings>
> </properties>
> {code}
> The theoretical definition of metadata mapping is available in "[A survey of
> techniques for achieving metadata
> interoperability|http://www.researchgate.net/profile/Bernhard_Haslhofer/publication/220566013_A_survey_of_techniques_for_achieving_metadata_interoperability/links/02e7e533e76187c0b8000000.pdf]".
> This paper shows also some basic examples of metadata mappings.
> Currently, I am still working on some core functionalities, but I have
> already performed some experiments by using a small prototype.
> By the way, I think that we should modify the method {{add}} in order to use
> {{set}} instead of {{metadata.put}} (currently, line number 316 of
> {{Metadata.java}}). This is a trivial change (I could create a new Jira issue
> about that), but it would allow to be coherent with the other implementation
> of {{add}} method and, moreover, the methods of {{Metadata}} could be
> extended more easily.
> I would really appreciate your feedback about this proposal. If you believe
> that it is a good idea, I could provide the code in few days.
> Thanks a lot,
> Giuseppe
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)