[jira] [Commented] (TIKA-1691) Apache Tika for enabling metadata interoperability

Giuseppe Totaro (JIRA) Wed, 22 Jul 2015 16:43:29 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637930#comment-14637930
 ]


Giuseppe Totaro commented on TIKA-1691:
---------------------------------------

Hello [~gagravarr],

your feedback is very much appreciated. I believe that providing metadata 
mapping on the getter side is a great idea. However, I will try to clarify my 
proposal below by reporting two (high-level) use cases.

As use case, we can consider the following:

We want to index both textual content and metadata from a heterogeneous set of 
digital documents, providing uniform access to the metadata properties 
extracted from files. Therefore, we want to allow users to submit search 
queries by using an end-user specific mediated schema.

We can summarize the use case above as follows:
# collect a huge amount of heterogeneous files (e.g., PDF, DOC, JPG, PPT, TXT, 
etc);
# extract both text and metadata from files by using Tika;
# map all metadata properties to a mediated schema that will be used for 
searching purposes;
# create an inverted index from the extracted contents;
# use the index in order to perform search queries based on metadata values.

Another use case is the following:

We want to compute some similarity metrics based on metadata features. To 
perform similarity, we need to provide the semantic correspondences among 
different metadata schemes.

We can summarize the use case above as follows:
# collect a huge amount of heterogeneous files (e.g., PDF, DOC, JPG, PPT, TXT, 
etc);
# extract both text and metadata from files by using Tika;
# map all metadata properties to a mediated schema that will be used for 
performing similarity among different schemes;
# use the metadata mapping to compute the given similarity metric among 
metadata from different schemes.

Currently, Tika enables consistent metadata across file formats by relying on 
[TikaCoreProperties|http://tika.apache.org/1.9/api/org/apache/tika/metadata/TikaCoreProperties.html],
 that are defined in terms of other standard namespaces. However, this core set 
of metadata could limit the interoperability among many metadata schemes, since 
Tika developers are continually providing support to new filetypes (and 
metadata schemes). 

Furthermore, I have identified two more functionalities for better metadata 
interoperability:
* a fine-grained mapping technique to potentially define metadata mappings for 
each mimetype. This allows, for example, either to exclude the mapping of 
metadata for some types or to provide different mappings of the same schema on 
different types. 
* a metadata mapping technique that subsumes schema mapping (property names) 
and instance transformation (property values).

I am working on providing a default mediated schema (via XML-based 
configuration) based on a core set of utility (Java) methods for metadata 
mapping.

You can find in attachment (_mapping_example_) an extremely simple diagram that 
reports an example of metadata mapping by defining source property, target 
property (that provides essentially schema mapping), mapping expression (that 
describes the semantics of each mapping relationship), and function (that 
provides instance transformation).

By the way, I am working also on a [D3|http://d3js.org/]-based utility that 
allows to visualize the new metadata mappings provided into Tika starting from 
the XML configuration file (i.e., {{tike-metadata.xml}}). The output is based 
on [hierarchical edge building 
algorithm|https://github.com/mbostock/d3/wiki/Bundle-Layout].

Regarding the possibility to provide mappings on the getter side, I thing that 
is a great idea. I believe that we should enable the users to select 
programmatically (or via configuration) whether using mappings on setter side 
or not. For instance, providing mappings on setter side requires to perform the 
actual mapping only during extraction, whereas on the getter side the mappings 
would be performed for each {{metadata.get()}}.

Thanks again Nich for your feedback. I hope that you are going to give more 
comments on this work. I would really appreciate it.
I take this opportunity to thank [~chrismattmann] for supporting me on this 
work.

Cheers,
Giuseppe

> Apache Tika for enabling metadata interoperability
> --------------------------------------------------
>
>                 Key: TIKA-1691
>                 URL: https://issues.apache.org/jira/browse/TIKA-1691
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Giuseppe Totaro
>            Assignee: Giuseppe Totaro
>              Labels: mapping, metadata
>         Attachments: mapping_example.pdf
>
>
> If am not wrong, enabling consistent metadata across file formats is already 
> (partially) provided into Tika by relying on {{TikaCoreProperties}} and, 
> within the context of Solr, {{ExtractingRequestHandler}} (by defining how to 
> map metadata fields in {{solrconfig.xml}}). However, I am working on a new 
> component for both schema mapping (to operate on the name of metadata 
> properties) and instance transformation (to operate on the value of metadata) 
> that consists, essentially, of the following changes:
> * A wrapper of {{Metadata}} object ({{MappedMetadata.java}}) that decorates 
> the {{set}} method (currently, line number 367 of {{Metadata.java}}) by 
> applying the given mapping functions (via configuration) before setting 
> metadata properties.
> * Basic mapping functions ({{BasicMappingUtils.java}}) that are utility 
> methods to map a set of metadata to the target schema.
> * A new {{MetadataConfig}} object that, as well as {{TikaConfig}}, may be 
> configured via XML file (organized as showed in the following snippet) and 
> allows to perform a fine-grained metadata mapping by using Java reflection.
> {code:xml|title=tika-metadata.xml|borderStyle=solid}
> <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> <properties>
>   <mappings>
>     <mapping type="type/sub-type">
>       <relation name="SOURCE_FIELD">
>         <target>TARGET_FIELD</target>
>         <expression>exclude|include|equivalent|overlap</expression>
>         <function name="FUNCTION_NAME">
>           <argument>ARGUMENT_VALUE</argument>
>         </function>
>         <cardinality>
>           <source>SOURCE_CARDINALITY</source>
>           <target>TARGET_CARDINALITY</target>
>           <order>ORDER_NUMBER</order>
>           <dependencies>
>             <field>FIELD_NAME</field>
>           </dependencies>
>         </cardinality>
>       </relation>
>     </mapping>
>     ...
>     <mapping> <!-- This contains the fallback strategy for unknown metadata 
> -->
>       <relation>
>         ...
>       </relation>
>     <mapping>
>   </mappings>
> </properties>
> {code}
> The theoretical definition of metadata mapping is available in "[A survey of 
> techniques for achieving metadata 
> interoperability|http://www.researchgate.net/profile/Bernhard_Haslhofer/publication/220566013_A_survey_of_techniques_for_achieving_metadata_interoperability/links/02e7e533e76187c0b8000000.pdf]";.
>  This paper shows also some basic examples of metadata mappings.
> Currently, I am still working on some core functionalities, but I have 
> already performed some experiments by using a small prototype.
> By the way, I think that we should modify the method {{add}} in order to use 
> {{set}} instead of {{metadata.put}} (currently, line number 316 of 
> {{Metadata.java}}). This is a trivial change (I could create a new Jira issue 
> about that), but it would allow to be coherent with the other implementation 
> of {{add}} method and, moreover, the methods of {{Metadata}} could be 
> extended more easily.
> I would really appreciate your feedback about this proposal. If you believe 
> that it is a good idea, I could provide the code in few days.
> Thanks a lot,
> Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1691) Apache Tika for enabling metadata interoperability

Reply via email to