[jira] [Commented] (TIKA-1283) Add "thumbnail" as possible metadata item to TikaCoreProperties
[ https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983813#comment-13983813 ] Tim Allison commented on TIKA-1283: --- [~thaichat04], thank you, as always. By "thumbnail," I'd also want to include images/icons of documents that are included only for display purposes. For example, the icon image (image1.emf) in test-documents/EmbeddedPDF.docx doesn't have a "relationship"=thumbnail, but I'd want to include that as a thumbnail because it appears as an within a . The point you make about the differences in handling of these by application is right on. Each application links thumbnail images to the underlying data in different ways, and we'll have to go application by application to do this correctly (whether we go with this or TIKA-90) I'm not held to the original proposal in this issue, and I like the clarity of TIKA-90 quite a bit. Some other thoughts...the signature I proposed above won't work because a given image can have more than one thumbnail (at least for RTFs) and it misses metadata around the thumbnail image (such as mediaType of the thumbnail). > Add "thumbnail" as possible metadata item to TikaCoreProperties > --- > > Key: TIKA-1283 > URL: https://issues.apache.org/jira/browse/TIKA-1283 > Project: Tika > Issue Type: Improvement > Components: metadata >Reporter: Tim Allison >Priority: Minor > > TIKA-90 originally requested to add thumbnails to a document's metadata. > I'd like to have a unified way of determining whether an embedded > document/resource is a thumbnail or a regular attachment. > With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling > out more thumbnails than before. > I propose adding "tika:thumbnail" to the metadata of each thumbnail image. > The consumer can then determine what to do with the embedded resource based > on the metadata. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1283) Add "thumbnail" as possible metadata item to TikaCoreProperties
[ https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983434#comment-13983434 ] Hong-Thai Nguyen commented on TIKA-1283: +1 for me to create a thumbnail field in metadata Set. - For OOXML, that's an item inside archive (see TIKA-1223). PowerPoint has always embedded thumbnail in Jpeg, but optional with docx & xlsx (available only when user check on 'save preview' option when saving document). - For OLE Documents, see: http://poi.apache.org/hpsf/thumbnails.html. You can get thumbnail content from POI API: {code} static byte[] process(File docFile) throws Exception { final HWPFDocumentCore wordDocument = AbstractWordUtils.loadDoc(docFile); SummaryInformation summaryInformation = wordDocument.getSummaryInformation(); System.out.println(summaryInformation.getAuthor()); System.out.println(summaryInformation.getApplicationName() + ":" + summaryInformation.getTitle()); Thumbnail thumbnail = new Thumbnail(summaryInformation.getThumbnail()); System.out.println(thumbnail.getClipboardFormat()); System.out.println(thumbnail.getClipboardFormatTag()); return thumbnail.getThumbnailAsWMF(); } {code} Unfortunately , there's an open bug on POI to get properly thumbnail content: https://issues.apache.org/bugzilla/show_bug.cgi?id=56194 docx, xlsx & ole formats, they are WMF & EMF formats. Quite difficult to handle these kind of images. But, this is out of our scope. > Add "thumbnail" as possible metadata item to TikaCoreProperties > --- > > Key: TIKA-1283 > URL: https://issues.apache.org/jira/browse/TIKA-1283 > Project: Tika > Issue Type: Improvement > Components: metadata >Reporter: Tim Allison >Priority: Minor > > TIKA-90 originally requested to add thumbnails to a document's metadata. > I'd like to have a unified way of determining whether an embedded > document/resource is a thumbnail or a regular attachment. > With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling > out more thumbnails than before. > I propose adding "tika:thumbnail" to the metadata of each thumbnail image. > The consumer can then determine what to do with the embedded resource based > on the metadata. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1283) Add "thumbnail" as possible metadata item to TikaCoreProperties
[ https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983191#comment-13983191 ] Tim Allison commented on TIKA-1283: --- Y, I absolutely agree with the distinction. Is there a clean way of implementing that that wouldn't break too much? Perhaps treat them as very different from the regular .get(String/Property...) in Metadata: {noformat} byte[] tn = metadata.getThumbnailData() {noformat} One argument against this is that clients would then have to add the step of extracting thumbnails from the metadata and EmbeddedResourceHandler would no longer pull everything as elegantly as it does now (if the user wants all attachments and thumbnails). Let me look into how hard it will be to associate a thumbnail with an embedded resource. RTF is easy, but the microsoft/ooxml might be a bit messy. > Add "thumbnail" as possible metadata item to TikaCoreProperties > --- > > Key: TIKA-1283 > URL: https://issues.apache.org/jira/browse/TIKA-1283 > Project: Tika > Issue Type: Improvement > Components: metadata >Reporter: Tim Allison >Priority: Minor > > TIKA-90 originally requested to add thumbnails to a document's metadata. > I'd like to have a unified way of determining whether an embedded > document/resource is a thumbnail or a regular attachment. > With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling > out more thumbnails than before. > I propose adding "tika:thumbnail" to the metadata of each thumbnail image. > The consumer can then determine what to do with the embedded resource based > on the metadata. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1283) Add "thumbnail" as possible metadata item to TikaCoreProperties
[ https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983167#comment-13983167 ] Jukka Zitting commented on TIKA-1283: - I'm not sure if it's a good idea to extract thumbnail images as regular embedded resource. A thumbnail is not similarly a "part of" the document like an embedded image or an attached file. Instead a thumbnail is used to "describe" or "visualize" a document, and thus would IMHO be better expressed as a part of document metadata as suggested in TIKA-90. > Add "thumbnail" as possible metadata item to TikaCoreProperties > --- > > Key: TIKA-1283 > URL: https://issues.apache.org/jira/browse/TIKA-1283 > Project: Tika > Issue Type: Improvement > Components: metadata >Reporter: Tim Allison >Priority: Minor > > TIKA-90 originally requested to add thumbnails to a document's metadata. > I'd like to have a unified way of determining whether an embedded > document/resource is a thumbnail or a regular attachment. > With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling > out more thumbnails than before. > I propose adding "tika:thumbnail" to the metadata of each thumbnail image. > The consumer can then determine what to do with the embedded resource based > on the metadata. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1283) Add "thumbnail" as possible metadata item to TikaCoreProperties
[ https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983163#comment-13983163 ] Tim Allison commented on TIKA-1283: --- I look forward to feedback on this issue. I think there is a fairly clear distinction between thumbnail and attached image, but this might get murky. On specific document types, there are some issues: * RTF is easy * ooxml now has a literal "thumbnail", but there are also the emf and wmf files that do not have a literal thumbnail "relationship"...how do we handle these? * pre-ooxml office...haven't dug deeply yet, but thumbnails there are emf and wmf...no? * PDF...I'd also like to be able to distinguish between attached image files and embedded image files (TIKA-1268), but this is better handled as a separate issue? *other formats?? > Add "thumbnail" as possible metadata item to TikaCoreProperties > --- > > Key: TIKA-1283 > URL: https://issues.apache.org/jira/browse/TIKA-1283 > Project: Tika > Issue Type: Improvement > Components: metadata >Reporter: Tim Allison >Priority: Minor > > TIKA-90 originally requested to add thumbnails to a document's metadata. > I'd like to have a unified way of determining whether an embedded > document/resource is a thumbnail or a regular attachment. > With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling > out more thumbnails than before. > I propose adding "tika:thumbnail" to the metadata of each embedded document. > The consumer can then determine what to do with the embedded resource. -- This message was sent by Atlassian JIRA (v6.2#6252)