On Sun, Dec 7, 2008 at 11:22 AM, Jukka Zitting <[EMAIL PROTECTED]> wrote:
> Hi,
>
> On Wed, Dec 3, 2008 at 1:05 PM, Robert Burrell Donkin
> <[EMAIL PROTECTED]> wrote:
>> should be simple enough to support minimal subclassing eg
>> tika:content-type -> dc:format
>
> We could do that, but what's the use case?

i'm thinking mainly of automated or computer-assisted cases

> The primary use case I'm thinking of is having a clear set of metadata
> fields that I can easily map to specific fields in a search index. For
> this use case it doesn't really matter what metadata schema we use as
> long as it's clear enough and we are consistent in using it

yes

> (e.g. all
> dc:format values produced by Tika would be MIME types, all dates of a
> specific format, etc.).

let's assume that - when used with tika - dc:format is implicitly
subclassed as media-type (a MIME attribute of content-type). dc:format
is a well used vocabulary. the difficult class of documents are those
that have both MIME type and dc:format meta-data but these are
unequal.

> A secondary use case is being able to easily use those fields when
> integrating with external metadata-aware applications. Here I think
> Dublin Core is the best alternative as I believe it's the most widely
> used and best understood (relatively speaking) metadata schema there
> is.

DC is imprecise and so difficult to work with. the semantic web crowd
now seem to prefer more precise schema which are more suitable for
automated reasoners.

> Currently I don't see where using subclasses or alternative schemas
> would bring enough value to counter the added complexity, but I'd be
> happy be proven wrong.

DC is not a rich vocabulary. taking a look at
http://lucene.apache.org/tika/formats.html for Microsoft's OLE 2
Compound Document format we have (i've tried to figure out some
mappings to dc in square brackets)

    * TITLE  Title [dc:creator]
    * SUBJECT Subject  [dc:subject? dc:description? dc:abstract?]
    * AUTHOR Author  [dc:creator? -> dc:contributor?]
    * KEYWORDS Keywords -> [dc:subject]
    * COMMENTS Comments [?]
    * TEMPLATE Template [?]
    * LAST_SAVED_BY Last Saved By [?]
    * REVISION_NUMBER Revision Number [?]
    * LAST_PRINTED Last Printed [?]
    * LAST_SAVED Last Saved Time/Date [dc:date?]
    * PAGE_COUNT Number of Pages [?]
    * WORD_COUNT Number of Words [?]
    * CHARACTER_COUNT Number of Characters [?]
    * APPLICATION_NAME Name of Creating Application [dc:creator?]

it's probably possible to find reasonable DC mappings for some of the
rest. some look like concepts which aren't really covered. note also
that there are a number of exact meta-data attribute which may
reasonably be mapped into more general dc attributes eg. dc:date (for
example, LAST_SAVED, LAST_PRINTED are both subclasses of dc:date)

the main use case i have in mind for synonyms is indexed searching. in
particular, being able to drill down facets. consider searches on a
large hetrogeneous body of documents.  tika has been used to extract
the meta-data which is then stored and indexed. my use case is
searching on date, so map to dc:date. this is a good search that
should pick all all manner of documents which had all manner of
changes on that date. when there are too many results for me to
browse, i want to be able to drill down the subclass which have
members in the documents retrieved. this allows both general searches
(on top level synonyms) and more precise ones (lower level classes).

for indexing, every time the index is run the synonyms would need to
be used to generate derived meta-data from the original set.

- robert

Reply via email to