[
https://issues.apache.org/jira/browse/TIKA-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637930#comment-14637930
]
Giuseppe Totaro commented on TIKA-1691:
---------------------------------------
Hello [~gagravarr],
your feedback is very much appreciated. I believe that providing metadata
mapping on the getter side is a great idea. However, I will try to clarify my
proposal below by reporting two (high-level) use cases.
As use case, we can consider the following:
We want to index both textual content and metadata from a heterogeneous set of
digital documents, providing uniform access to the metadata properties
extracted from files. Therefore, we want to allow users to submit search
queries by using an end-user specific mediated schema.
We can summarize the use case above as follows:
# collect a huge amount of heterogeneous files (e.g., PDF, DOC, JPG, PPT, TXT,
etc);
# extract both text and metadata from files by using Tika;
# map all metadata properties to a mediated schema that will be used for
searching purposes;
# create an inverted index from the extracted contents;
# use the index in order to perform search queries based on metadata values.
Another use case is the following:
We want to compute some similarity metrics based on metadata features. To
perform similarity, we need to provide the semantic correspondences among
different metadata schemes.
We can summarize the use case above as follows:
# collect a huge amount of heterogeneous files (e.g., PDF, DOC, JPG, PPT, TXT,
etc);
# extract both text and metadata from files by using Tika;
# map all metadata properties to a mediated schema that will be used for
performing similarity among different schemes;
# use the metadata mapping to compute the given similarity metric among
metadata from different schemes.
Currently, Tika enables consistent metadata across file formats by relying on
[TikaCoreProperties|http://tika.apache.org/1.9/api/org/apache/tika/metadata/TikaCoreProperties.html],
that are defined in terms of other standard namespaces. However, this core set
of metadata could limit the interoperability among many metadata schemes, since
Tika developers are continually providing support to new filetypes (and
metadata schemes).
Furthermore, I have identified two more functionalities for better metadata
interoperability:
* a fine-grained mapping technique to potentially define metadata mappings for
each mimetype. This allows, for example, either to exclude the mapping of
metadata for some types or to provide different mappings of the same schema on
different types.
* a metadata mapping technique that subsumes schema mapping (property names)
and instance transformation (property values).
I am working on providing a default mediated schema (via XML-based
configuration) based on a core set of utility (Java) methods for metadata
mapping.
You can find in attachment (_mapping_example_) an extremely simple diagram that
reports an example of metadata mapping by defining source property, target
property (that provides essentially schema mapping), mapping expression (that
describes the semantics of each mapping relationship), and function (that
provides instance transformation).
By the way, I am working also on a [D3|http://d3js.org/]-based utility that
allows to visualize the new metadata mappings provided into Tika starting from
the XML configuration file (i.e., {{tike-metadata.xml}}). The output is based
on [hierarchical edge building
algorithm|https://github.com/mbostock/d3/wiki/Bundle-Layout].
Regarding the possibility to provide mappings on the getter side, I thing that
is a great idea. I believe that we should enable the users to select
programmatically (or via configuration) whether using mappings on setter side
or not. For instance, providing mappings on setter side requires to perform the
actual mapping only during extraction, whereas on the getter side the mappings
would be performed for each {{metadata.get()}}.
Thanks again Nich for your feedback. I hope that you are going to give more
comments on this work. I would really appreciate it.
I take this opportunity to thank [~chrismattmann] for supporting me on this
work.
Cheers,
Giuseppe
> Apache Tika for enabling metadata interoperability
> --------------------------------------------------
>
> Key: TIKA-1691
> URL: https://issues.apache.org/jira/browse/TIKA-1691
> Project: Tika
> Issue Type: New Feature
> Reporter: Giuseppe Totaro
> Assignee: Giuseppe Totaro
> Labels: mapping, metadata
> Attachments: mapping_example.pdf
>
>
> If am not wrong, enabling consistent metadata across file formats is already
> (partially) provided into Tika by relying on {{TikaCoreProperties}} and,
> within the context of Solr, {{ExtractingRequestHandler}} (by defining how to
> map metadata fields in {{solrconfig.xml}}). However, I am working on a new
> component for both schema mapping (to operate on the name of metadata
> properties) and instance transformation (to operate on the value of metadata)
> that consists, essentially, of the following changes:
> * A wrapper of {{Metadata}} object ({{MappedMetadata.java}}) that decorates
> the {{set}} method (currently, line number 367 of {{Metadata.java}}) by
> applying the given mapping functions (via configuration) before setting
> metadata properties.
> * Basic mapping functions ({{BasicMappingUtils.java}}) that are utility
> methods to map a set of metadata to the target schema.
> * A new {{MetadataConfig}} object that, as well as {{TikaConfig}}, may be
> configured via XML file (organized as showed in the following snippet) and
> allows to perform a fine-grained metadata mapping by using Java reflection.
> {code:xml|title=tika-metadata.xml|borderStyle=solid}
> <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> <properties>
> <mappings>
> <mapping type="type/sub-type">
> <relation name="SOURCE_FIELD">
> <target>TARGET_FIELD</target>
> <expression>exclude|include|equivalent|overlap</expression>
> <function name="FUNCTION_NAME">
> <argument>ARGUMENT_VALUE</argument>
> </function>
> <cardinality>
> <source>SOURCE_CARDINALITY</source>
> <target>TARGET_CARDINALITY</target>
> <order>ORDER_NUMBER</order>
> <dependencies>
> <field>FIELD_NAME</field>
> </dependencies>
> </cardinality>
> </relation>
> </mapping>
> ...
> <mapping> <!-- This contains the fallback strategy for unknown metadata
> -->
> <relation>
> ...
> </relation>
> <mapping>
> </mappings>
> </properties>
> {code}
> The theoretical definition of metadata mapping is available in "[A survey of
> techniques for achieving metadata
> interoperability|http://www.researchgate.net/profile/Bernhard_Haslhofer/publication/220566013_A_survey_of_techniques_for_achieving_metadata_interoperability/links/02e7e533e76187c0b8000000.pdf]".
> This paper shows also some basic examples of metadata mappings.
> Currently, I am still working on some core functionalities, but I have
> already performed some experiments by using a small prototype.
> By the way, I think that we should modify the method {{add}} in order to use
> {{set}} instead of {{metadata.put}} (currently, line number 316 of
> {{Metadata.java}}). This is a trivial change (I could create a new Jira issue
> about that), but it would allow to be coherent with the other implementation
> of {{add}} method and, moreover, the methods of {{Metadata}} could be
> extended more easily.
> I would really appreciate your feedback about this proposal. If you believe
> that it is a good idea, I could provide the code in few days.
> Thanks a lot,
> Giuseppe
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)