[jira] [Commented] (TIKA-1283) Add "thumbnail" as possible metadata item to TikaCoreProperties

2014-04-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983813#comment-13983813
 ] 

Tim Allison commented on TIKA-1283:
---

[~thaichat04], thank you, as always.  By "thumbnail," I'd also want to include 
images/icons of documents that are included only for display purposes.  For 
example, the icon image (image1.emf) in test-documents/EmbeddedPDF.docx doesn't 
have a "relationship"=thumbnail, but I'd want to include that as a thumbnail 
because it appears as an  within a .  

The point you make about the differences in handling of these by application is 
right on.  Each application links thumbnail images to the underlying data in 
different ways, and we'll have to go application by application to do this 
correctly (whether we go with this or TIKA-90)

I'm not held to the original proposal in this issue, and I like the clarity of 
TIKA-90 quite a bit.  Some other thoughts...the signature I proposed above 
won't work because a given image can have more than one thumbnail (at least for 
RTFs) and it misses metadata around the thumbnail image (such as mediaType of 
the thumbnail). 

> Add "thumbnail" as possible metadata item to TikaCoreProperties
> ---
>
> Key: TIKA-1283
> URL: https://issues.apache.org/jira/browse/TIKA-1283
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Tim Allison
>Priority: Minor
>
> TIKA-90 originally requested to add thumbnails to a document's metadata.
> I'd like to have a unified way of determining whether an embedded 
> document/resource is a thumbnail or a regular attachment.
> With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
> out more thumbnails than before.
> I propose adding "tika:thumbnail" to the metadata of each thumbnail image.  
> The consumer can then determine what to do with the embedded resource based 
> on the metadata.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1283) Add "thumbnail" as possible metadata item to TikaCoreProperties

2014-04-28 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983434#comment-13983434
 ] 

Hong-Thai Nguyen commented on TIKA-1283:


+1 for me to create a thumbnail field in metadata Set.
- For OOXML, that's an item inside archive (see TIKA-1223). PowerPoint has 
always embedded thumbnail in Jpeg, but optional with docx & xlsx (available 
only when user check on 'save preview' option when saving document).
- For OLE Documents, see: http://poi.apache.org/hpsf/thumbnails.html. You can 
get thumbnail content from POI API:
{code}
static byte[] process(File docFile) throws Exception {
final HWPFDocumentCore wordDocument = AbstractWordUtils.loadDoc(docFile);
SummaryInformation summaryInformation = 
wordDocument.getSummaryInformation();
System.out.println(summaryInformation.getAuthor());
System.out.println(summaryInformation.getApplicationName() + ":" + 
summaryInformation.getTitle());
Thumbnail thumbnail = new Thumbnail(summaryInformation.getThumbnail());
System.out.println(thumbnail.getClipboardFormat());
System.out.println(thumbnail.getClipboardFormatTag());
return thumbnail.getThumbnailAsWMF();
  }
{code}
Unfortunately , there's an open bug on POI to get properly thumbnail content: 
https://issues.apache.org/bugzilla/show_bug.cgi?id=56194
docx, xlsx & ole formats, they are WMF & EMF formats. Quite difficult to handle 
these kind of images. But, this is out of our scope.


> Add "thumbnail" as possible metadata item to TikaCoreProperties
> ---
>
> Key: TIKA-1283
> URL: https://issues.apache.org/jira/browse/TIKA-1283
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Tim Allison
>Priority: Minor
>
> TIKA-90 originally requested to add thumbnails to a document's metadata.
> I'd like to have a unified way of determining whether an embedded 
> document/resource is a thumbnail or a regular attachment.
> With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
> out more thumbnails than before.
> I propose adding "tika:thumbnail" to the metadata of each thumbnail image.  
> The consumer can then determine what to do with the embedded resource based 
> on the metadata.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1283) Add "thumbnail" as possible metadata item to TikaCoreProperties

2014-04-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983191#comment-13983191
 ] 

Tim Allison commented on TIKA-1283:
---

Y, I absolutely agree with the distinction.  Is there a clean way of 
implementing that that wouldn't break too much?

Perhaps treat them as very different from the regular .get(String/Property...) 
in Metadata:
{noformat} 
byte[] tn = metadata.getThumbnailData()
{noformat}

One argument against this is that clients would then have to add the step of 
extracting thumbnails from the metadata and EmbeddedResourceHandler would no 
longer pull everything as elegantly as it does now (if the user wants all 
attachments and thumbnails).

Let me look into how hard it will be to associate a thumbnail with an embedded 
resource.  RTF is easy, but the microsoft/ooxml might be a bit messy.



> Add "thumbnail" as possible metadata item to TikaCoreProperties
> ---
>
> Key: TIKA-1283
> URL: https://issues.apache.org/jira/browse/TIKA-1283
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Tim Allison
>Priority: Minor
>
> TIKA-90 originally requested to add thumbnails to a document's metadata.
> I'd like to have a unified way of determining whether an embedded 
> document/resource is a thumbnail or a regular attachment.
> With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
> out more thumbnails than before.
> I propose adding "tika:thumbnail" to the metadata of each thumbnail image.  
> The consumer can then determine what to do with the embedded resource based 
> on the metadata.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1283) Add "thumbnail" as possible metadata item to TikaCoreProperties

2014-04-28 Thread Jukka Zitting (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983167#comment-13983167
 ] 

Jukka Zitting commented on TIKA-1283:
-

I'm not sure if it's a good idea to extract thumbnail images as regular 
embedded resource. A thumbnail is not similarly a "part of" the document like 
an embedded image or an attached file. Instead a thumbnail is used to 
"describe" or "visualize" a document, and thus would IMHO be better expressed 
as a part of document metadata as suggested in TIKA-90.

> Add "thumbnail" as possible metadata item to TikaCoreProperties
> ---
>
> Key: TIKA-1283
> URL: https://issues.apache.org/jira/browse/TIKA-1283
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Tim Allison
>Priority: Minor
>
> TIKA-90 originally requested to add thumbnails to a document's metadata.
> I'd like to have a unified way of determining whether an embedded 
> document/resource is a thumbnail or a regular attachment.
> With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
> out more thumbnails than before.
> I propose adding "tika:thumbnail" to the metadata of each thumbnail image.  
> The consumer can then determine what to do with the embedded resource based 
> on the metadata.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1283) Add "thumbnail" as possible metadata item to TikaCoreProperties

2014-04-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983163#comment-13983163
 ] 

Tim Allison commented on TIKA-1283:
---

I look forward to feedback on this issue.  I think there is a fairly clear 
distinction between thumbnail and attached image, but this might get murky.

On specific document types, there are some issues:
* RTF is easy
* ooxml now has a literal "thumbnail", but there are also the emf and wmf files 
that do not have a literal thumbnail "relationship"...how do we handle these?
* pre-ooxml office...haven't dug deeply yet, but thumbnails there are emf and 
wmf...no?
* PDF...I'd also like to be able to distinguish between attached image files 
and embedded image files (TIKA-1268), but this is better handled as a separate 
issue?

*other formats??

> Add "thumbnail" as possible metadata item to TikaCoreProperties
> ---
>
> Key: TIKA-1283
> URL: https://issues.apache.org/jira/browse/TIKA-1283
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Tim Allison
>Priority: Minor
>
> TIKA-90 originally requested to add thumbnails to a document's metadata.
> I'd like to have a unified way of determining whether an embedded 
> document/resource is a thumbnail or a regular attachment.
> With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
> out more thumbnails than before.
> I propose adding "tika:thumbnail" to the metadata of each embedded document.  
> The consumer can then determine what to do with the embedded resource.



--
This message was sent by Atlassian JIRA
(v6.2#6252)