Re: Extract thumbnail from openxml office files

Mattmann, Chris A (398J) Thu, 09 Jan 2014 06:06:23 -0800

Hi Hong-Thai,

+1 to using cardinality to help denote more complex metadata relationships
at least until we get past prior discussions on Metadata and name spacing.


See the wiki here for some prior past thoughts:
http://wiki.apache.org/tika/MetadataDiscussion


I know our met structure is simple -- it was purposefully designed that way
even though at the time very complex and hierarchical metadata structures
existed
and could have been leveraged but instead were not in favor of a simple
approach
, e.g., key "mutli-"value (note distinction between key "value").

Thanks!

Cheers,
Chris



-----Original Message-----
From: Hong-Thai Nguyen <hong-thai.ngu...@polyspot.com>
Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
Date: Thursday, January 9, 2014 8:36 AM
To: "dev@tika.apache.org" <dev@tika.apache.org>
Subject: RE: Extract thumbnail from openxml office files

>Hi Nick,
>You're begining a very interesting topic about foundation of our metadata
>concept :)
>I agree with you that metadata is not the best place to store thumbnail
>result. Until now, our metadata is simple map with key:values. This
>structure is not really flexiable in some cases. For exemple, we would
>store author's information, each author has a first name and a last name.
>Ideally, we could have some like struct:
>Person:
>       FirstName
>       LastName
>
>An other example is for our futur thumbnail. If we can have a metadata
>'thumbnail' with hierarchical structure like:
>Thumbnail:
>       Dimension
>               Width
>               Length
>       MimeType
>       Extension
>       Pages
>       Description
>
>That needs a huge refactoring about our core model. An other solution is
>we can keep thumbnail result is a list List<byte[]> insteads of a single
>value. An element is the thumbnail of a page. If the list has only 1
>element, mean there's only thumbnail of the first page.
>
>Hong-Thai
>
>-----Message d'origine-----
>De : Nick Burch [mailto:apa...@gagravarr.org]
>Envoyé : jeudi 9 janvier 2014 12:11
>À : dev@tika.apache.org
>Objet : RE: Extract thumbnail from openxml office files
>
>On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote:
>> By searching on issues, I found the issue already created:
>> https://issues.apache.org/jira/browse/TIKA-90
>
>I'm not sure if the metadata is the right place to return this. Some
>formats offer a small thumbnail, others can offer a small thumbnail for
>every page, and at least one can include a full-size image of the first
>page.
>
>Would we not be better off exposing these embedded renderings via the
>existing embedded resources handling, with some sort of handy way to
>identify what something is (eg this is a full-size PNG of page 1, this is
>a jpg thumbnail of page 3)?
>
>Nick

Re: Extract thumbnail from openxml office files

Reply via email to