[ https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983434#comment-13983434 ]
Hong-Thai Nguyen commented on TIKA-1283: ---------------------------------------- +1 for me to create a thumbnail field in metadata Set. - For OOXML, that's an item inside archive (see TIKA-1223). PowerPoint has always embedded thumbnail in Jpeg, but optional with docx & xlsx (available only when user check on 'save preview' option when saving document). - For OLE Documents, see: http://poi.apache.org/hpsf/thumbnails.html. You can get thumbnail content from POI API: {code} static byte[] process(File docFile) throws Exception { final HWPFDocumentCore wordDocument = AbstractWordUtils.loadDoc(docFile); SummaryInformation summaryInformation = wordDocument.getSummaryInformation(); System.out.println(summaryInformation.getAuthor()); System.out.println(summaryInformation.getApplicationName() + ":" + summaryInformation.getTitle()); Thumbnail thumbnail = new Thumbnail(summaryInformation.getThumbnail()); System.out.println(thumbnail.getClipboardFormat()); System.out.println(thumbnail.getClipboardFormatTag()); return thumbnail.getThumbnailAsWMF(); } {code} Unfortunately , there's an open bug on POI to get properly thumbnail content: https://issues.apache.org/bugzilla/show_bug.cgi?id=56194 docx, xlsx & ole formats, they are WMF & EMF formats. Quite difficult to handle these kind of images. But, this is out of our scope. > Add "thumbnail" as possible metadata item to TikaCoreProperties > --------------------------------------------------------------- > > Key: TIKA-1283 > URL: https://issues.apache.org/jira/browse/TIKA-1283 > Project: Tika > Issue Type: Improvement > Components: metadata > Reporter: Tim Allison > Priority: Minor > > TIKA-90 originally requested to add thumbnails to a document's metadata. > I'd like to have a unified way of determining whether an embedded > document/resource is a thumbnail or a regular attachment. > With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling > out more thumbnails than before. > I propose adding "tika:thumbnail" to the metadata of each thumbnail image. > The consumer can then determine what to do with the embedded resource based > on the metadata. -- This message was sent by Atlassian JIRA (v6.2#6252)