[ 
https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983434#comment-13983434
 ] 

Hong-Thai Nguyen commented on TIKA-1283:
----------------------------------------

+1 for me to create a thumbnail field in metadata Set.
- For OOXML, that's an item inside archive (see TIKA-1223). PowerPoint has 
always embedded thumbnail in Jpeg, but optional with docx & xlsx (available 
only when user check on 'save preview' option when saving document).
- For OLE Documents, see: http://poi.apache.org/hpsf/thumbnails.html. You can 
get thumbnail content from POI API:
{code}
static byte[] process(File docFile) throws Exception {
    final HWPFDocumentCore wordDocument = AbstractWordUtils.loadDoc(docFile);
    SummaryInformation summaryInformation = 
wordDocument.getSummaryInformation();
    System.out.println(summaryInformation.getAuthor());
    System.out.println(summaryInformation.getApplicationName() + ":" + 
summaryInformation.getTitle());
    Thumbnail thumbnail = new Thumbnail(summaryInformation.getThumbnail());
    System.out.println(thumbnail.getClipboardFormat());
    System.out.println(thumbnail.getClipboardFormatTag());
    return thumbnail.getThumbnailAsWMF();
  }
{code}
Unfortunately , there's an open bug on POI to get properly thumbnail content: 
https://issues.apache.org/bugzilla/show_bug.cgi?id=56194
docx, xlsx & ole formats, they are WMF & EMF formats. Quite difficult to handle 
these kind of images. But, this is out of our scope.


> Add "thumbnail" as possible metadata item to TikaCoreProperties
> ---------------------------------------------------------------
>
>                 Key: TIKA-1283
>                 URL: https://issues.apache.org/jira/browse/TIKA-1283
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata
>            Reporter: Tim Allison
>            Priority: Minor
>
> TIKA-90 originally requested to add thumbnails to a document's metadata.
> I'd like to have a unified way of determining whether an embedded 
> document/resource is a thumbnail or a regular attachment.
> With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
> out more thumbnails than before.
> I propose adding "tika:thumbnail" to the metadata of each thumbnail image.  
> The consumer can then determine what to do with the embedded resource based 
> on the metadata.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to