Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "MetadataDiscussion" page has been changed by domtheo:
http://wiki.apache.org/tika/MetadataDiscussion?action=diff&rev1=2&rev2=3

  This page has been created to host a discussion on how Tika returns metadata 
for different kinds of documents. The goal is to make sure that Tika users have 
a chance to get to all of the metadata created and/or extracted by Tika.
  
  == Original Problem ==
- The original inspiration for this page was a Tika user who wanted to get 
access to the metadata for every document in an archive (e.g. zip, tar.gz, 
etc.). A way to get recursive metadata is described in the RecursiveMetadata 
article.
+ The original inspiration for this page was a Tika user who wanted to get 
access to the metadata for every document in an archive (e.g. zip, tar.gz, 
etc.). A way to get [[http://www.propertykita.com/rumah.html|Rumah Dijual]] 
recursive metadata is described in the RecursiveMetadata article 
[[http://vamostech.com/gps-tracking|GPS Tracker]] and 
[[http://www.pedatimotor.com|Aksesoris Sparepart Motor]].
  
  == Goals for this Page ==
  The goals for this page are bigger than the original problem. This page 
should hold a discussion about how to better meet different metadata needs for 
the different kinds of documents supported by Tika, and for the different kinds 
of users supported by Tika.
@@ -69, +69 @@

  When I first started using Tika, I had the naive dream that I could point the 
AutoDetectParser at anything and it would automatically find the document 
boundaries that matter to me and make everything I consider a single document 
look like the following:
  
  {{{
- <html xmlns="http://www.w3.org/1999/xhtml";>
+ <html xmlns
    <head>
      <title>...</title>
      <thismeta>...</thismeta>
@@ -91, +91 @@

  == A Slightly Less Naive Non-Solution ==
  This solution is like the first naive solution, except it uses legal XHTML
  {{{
- <html xmlns="http://www.w3.org/1999/xhtml";>
+ <html xmlns=
    <head>
      <title>...</title>
      <meta name="description" content="Example XHTML" />
@@ -112, +112 @@

  == Div Sections: No Place for Metadata ==
  The first two non-solutions ignored that decisions have already been made 
about how Tika will represent structured documents and simple containers in 
XHTML. Tika represents a simple container document something like the following:
  {{{
- <html xmlns="http://www.w3.org/1999/xhtml";>
+ <html xmlns
    <head>
      <title>...</title>
    </head>
@@ -136, +136 @@

  
  The problem is that there is no place to put the metadata that is legal 
XHTML. The {{{<meta>}}} tags can only appear in the {{{<head>}}} section. Even 
if we wanted to put all metadata in the {{{<head>}}} section, doing so would 
mean that Tika could not stream the XHTML events, and instead of have to parse 
entire containers in two passes: once to gather the metadata, and a second time 
to output all of the text.
  
- If XHTML had a way to specify arbitrary name-value pairs somewhere in the 
{{{<div>}}} section, that could be used as a place to associate metadata with a 
{{{<div>}}} section. As far as I can tell from the specification 
[http://www.w3schools.com/tags/tag_div.asp] there isn't a place for arbitrary 
name-value pairs.
+ If XHTML had a way to specify arbitrary name-value pairs somewhere in the 
{{{<div>}}} section, that could be used as a place to associate metadata with a 
{{{<div>}}} section. As far as I can tell from the specification there isn't a 
place for arbitrary name-value pairs.
  
  = Potential Solutions That Could Work =
  Hopefully we can find some solutions that actually work, and work for many 
kinds of users. It doesn't look like there is a way to represent metadata for 
nested sections or nested documents in XHTML, but there may be other ways to 
make metadata nested metadata available to some users.

Reply via email to