Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "MetadataDiscussion" page has been changed by domtheo: http://wiki.apache.org/tika/MetadataDiscussion?action=diff&rev1=2&rev2=3 This page has been created to host a discussion on how Tika returns metadata for different kinds of documents. The goal is to make sure that Tika users have a chance to get to all of the metadata created and/or extracted by Tika. == Original Problem == - The original inspiration for this page was a Tika user who wanted to get access to the metadata for every document in an archive (e.g. zip, tar.gz, etc.). A way to get recursive metadata is described in the RecursiveMetadata article. + The original inspiration for this page was a Tika user who wanted to get access to the metadata for every document in an archive (e.g. zip, tar.gz, etc.). A way to get [[http://www.propertykita.com/rumah.html|Rumah Dijual]] recursive metadata is described in the RecursiveMetadata article [[http://vamostech.com/gps-tracking|GPS Tracker]] and [[http://www.pedatimotor.com|Aksesoris Sparepart Motor]]. == Goals for this Page == The goals for this page are bigger than the original problem. This page should hold a discussion about how to better meet different metadata needs for the different kinds of documents supported by Tika, and for the different kinds of users supported by Tika. @@ -69, +69 @@ When I first started using Tika, I had the naive dream that I could point the AutoDetectParser at anything and it would automatically find the document boundaries that matter to me and make everything I consider a single document look like the following: {{{ - <html xmlns="http://www.w3.org/1999/xhtml"> + <html xmlns <head> <title>...</title> <thismeta>...</thismeta> @@ -91, +91 @@ == A Slightly Less Naive Non-Solution == This solution is like the first naive solution, except it uses legal XHTML {{{ - <html xmlns="http://www.w3.org/1999/xhtml"> + <html xmlns= <head> <title>...</title> <meta name="description" content="Example XHTML" /> @@ -112, +112 @@ == Div Sections: No Place for Metadata == The first two non-solutions ignored that decisions have already been made about how Tika will represent structured documents and simple containers in XHTML. Tika represents a simple container document something like the following: {{{ - <html xmlns="http://www.w3.org/1999/xhtml"> + <html xmlns <head> <title>...</title> </head> @@ -136, +136 @@ The problem is that there is no place to put the metadata that is legal XHTML. The {{{<meta>}}} tags can only appear in the {{{<head>}}} section. Even if we wanted to put all metadata in the {{{<head>}}} section, doing so would mean that Tika could not stream the XHTML events, and instead of have to parse entire containers in two passes: once to gather the metadata, and a second time to output all of the text. - If XHTML had a way to specify arbitrary name-value pairs somewhere in the {{{<div>}}} section, that could be used as a place to associate metadata with a {{{<div>}}} section. As far as I can tell from the specification [http://www.w3schools.com/tags/tag_div.asp] there isn't a place for arbitrary name-value pairs. + If XHTML had a way to specify arbitrary name-value pairs somewhere in the {{{<div>}}} section, that could be used as a place to associate metadata with a {{{<div>}}} section. As far as I can tell from the specification there isn't a place for arbitrary name-value pairs. = Potential Solutions That Could Work = Hopefully we can find some solutions that actually work, and work for many kinds of users. It doesn't look like there is a way to represent metadata for nested sections or nested documents in XHTML, but there may be other ways to make metadata nested metadata available to some users.
