[jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle
[ https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rupert Westenthaler updated TIKA-1276: -- Attachment: TIKA-1276_20140428_rwesten.diff Hi all Added a 2nd patch that * (re-)enables unit tests for the Tika OSGI bundle * updated the Bundle Activator in tika-parsers to register a Detector and Parser that is similar as those returned by Tika#getDetector() and Tika#getParser() * the tests now check both (1) usage of the Tika class AND (2) usage of the Detector and Parser registered as OSGI services by the Bundle Activator * updated the tests to use the latest versions of pax exam (3.5) and felix (4.4) * this does not add additional tests for different media types. Those should bee added to the BundleIT#testParser This patch does not include the first one. It is based on a trunk version with the first patch already applied. Missing embedded dependencies in tika-bundle Key: TIKA-1276 URL: https://issues.apache.org/jira/browse/TIKA-1276 Project: Tika Issue Type: Bug Components: packaging Affects Versions: 1.5 Environment: OSGI, Apache Felix via Apache Sling Launcher Reporter: Rupert Westenthaler Fix For: 1.6 Attachments: TIKA-1276_20140423_rwesten.diff, TIKA-1276_20140428_rwesten.diff While updating from tika 1.2 to 1.5 I that the `org.apache.tika:tika-bundle:1.5` module has some missing dependences. 1. `com.uwyn:jhighlight:1.0` is not embedded Because of that installing the bundle results in the following exception {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 2. `org.ow2.asm:asm:4.1` is not embedded because `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and therefore the `Embed-Dependency` directive `asm` does not match any dependency. Because of that one do get the following exception (after fixing (1)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0 org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0))) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} There are two possibilities to fix this (a) change the `Embed-Dependency` to `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the tika-bundle pom file. 3. `edu.ucar:netcdf:4.2-min` is not embedded Because of that one does get the following exception (after fixing (1) and (2)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime After fixing the above
Calling OSGi experts - TIKA-1276 patch review
Hi All I know enough OSGi to be dangerous, but not enough to be sure of exactly what I should and shouldn't do... On TIKA-1276 we've got some suggested patches from Rupert Westenthaler which hopefully fix some Tika OSGi problems, as well as adding some more unit tests for the OSGi support. Any chance that someone who knows OSGi very well could review the patch, and either apply it (committer) or add a comment to the bug saying it's good to go (non-committer)? Thanks Nick
[jira] [Commented] (TIKA-1276) Missing embedded dependencies in tika-bundle
[ https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13982874#comment-13982874 ] Sergey Beryozkin commented on TIKA-1276: Hi Rupert I wonder should we take a completely different approach and avoid embedding at all which is not very OSGI friendly ? May be not for 1.6 but for some major release like Tika 2.0... Sergey Missing embedded dependencies in tika-bundle Key: TIKA-1276 URL: https://issues.apache.org/jira/browse/TIKA-1276 Project: Tika Issue Type: Bug Components: packaging Affects Versions: 1.5 Environment: OSGI, Apache Felix via Apache Sling Launcher Reporter: Rupert Westenthaler Fix For: 1.6 Attachments: TIKA-1276_20140423_rwesten.diff, TIKA-1276_20140428_rwesten.diff While updating from tika 1.2 to 1.5 I that the `org.apache.tika:tika-bundle:1.5` module has some missing dependences. 1. `com.uwyn:jhighlight:1.0` is not embedded Because of that installing the bundle results in the following exception {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 2. `org.ow2.asm:asm:4.1` is not embedded because `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and therefore the `Embed-Dependency` directive `asm` does not match any dependency. Because of that one do get the following exception (after fixing (1)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0 org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0))) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} There are two possibilities to fix this (a) change the `Embed-Dependency` to `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the tika-bundle pom file. 3. `edu.ucar:netcdf:4.2-min` is not embedded Because of that one does get the following exception (after fixing (1) and (2)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime After fixing the above issues the tika-bundle was started successfully. However when extracting EXIG metadata from a jpeg image I got the following exception. {code} java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException at com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112) at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71) at
Shared MIME info update
Hi all, I ran a diff on tika-mimetypes.xml and the latest Freedesktop share MIME info DB release (http://cgit.freedesktop.org/xdg/shared-mime-info/). It seems they have diverged quite a lot. Do you see benefit in bringing them closer together again? Or is licensing in the way (I think they dual license with LGPL anf AFL 2.0)? Thanks Matthias
Re: Shared MIME info update
On Mon, 28 Apr 2014, Matthias Krueger wrote: I ran a diff on tika-mimetypes.xml and the latest Freedesktop share MIME info DB release (http://cgit.freedesktop.org/xdg/shared-mime-info/). It seems they have diverged quite a lot. I don't think they've ever been the same. We use their XML format, but not their data. Our data comes from a mixture of places, initially the httpd mimetypes file, along with lots of bug reports, fixes etc since them. We also support one or two types that they don't Or is licensing in the way (I think they dual license with LGPL anf AFL 2.0)? They can take our nice work, but we can't theirs. The Apache License v2 is largely a universal donner license. LGPL and friends are largely not - see http://www.apache.org/legal/resolved.html#category-x . (It's largely the same thing with OpenOffice - LibreOffice are welcome to take fixes from Apache OpenOffice, and they do, but AOO can only take LO fixes where the contributor explicitly allows their changes to be Apache licensed) Nick
[jira] [Commented] (TIKA-1276) Missing embedded dependencies in tika-bundle
[ https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13982973#comment-13982973 ] Oleg Tikhonov commented on TIKA-1276: - Environment: Win 7 x64; OSGi engine: Apache Felix without patch got org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [7]: Unable to resolve 7.0: missing requirement [7.0] osgi.wiring.package; (osgi.wiring.package=javax.servlet) Note: 7 here is a tika-bundle-1.6-SNAPSHOT.jar with the patch: org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [8]: Unable to resolve 8.0: missing requirement [8.0] osgi.wiring.package; (osgi.wiring.package=javax.servlet) Note: 8 here is a patched tika-bundle-1.6-SNAPSHOT.jar. I.e in both cases cannot start. Seems to be the same. Missing embedded dependencies in tika-bundle Key: TIKA-1276 URL: https://issues.apache.org/jira/browse/TIKA-1276 Project: Tika Issue Type: Bug Components: packaging Affects Versions: 1.5 Environment: OSGI, Apache Felix via Apache Sling Launcher Reporter: Rupert Westenthaler Fix For: 1.6 Attachments: TIKA-1276_20140423_rwesten.diff, TIKA-1276_20140428_rwesten.diff While updating from tika 1.2 to 1.5 I that the `org.apache.tika:tika-bundle:1.5` module has some missing dependences. 1. `com.uwyn:jhighlight:1.0` is not embedded Because of that installing the bundle results in the following exception {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 2. `org.ow2.asm:asm:4.1` is not embedded because `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and therefore the `Embed-Dependency` directive `asm` does not match any dependency. Because of that one do get the following exception (after fixing (1)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0 org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0))) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} There are two possibilities to fix this (a) change the `Embed-Dependency` to `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the tika-bundle pom file. 3. `edu.ucar:netcdf:4.2-min` is not embedded Because of that one does get the following exception (after fixing (1) and (2)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime After fixing the above issues the tika-bundle was started successfully. However
[jira] [Commented] (TIKA-1276) Missing embedded dependencies in tika-bundle
[ https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13982995#comment-13982995 ] Rupert Westenthaler commented on TIKA-1276: --- Embedding dependencies is bad if those are also used by other bundles. The biggest dependencies of Tika are all dependencies of parsers (e.g. poi, pdf box ...). Most Tika users will not need those in other bundles. So having them embedded in tika-bundle is not a overhead. Tika does embed some dependencies that are OSGI Bundles * commons-compress and also it dependency xz is a bundle * commons-codec * apache-mime4j-core and apache-mime4j-dom * xmlbeans-2.3.0: There are bundle versions available by org.apache.servicemix.bundles:org.apache.servicemix.bundles.xmlbeans - starting from version 2.4. Those could be easily removed from the bundle. Missing embedded dependencies in tika-bundle Key: TIKA-1276 URL: https://issues.apache.org/jira/browse/TIKA-1276 Project: Tika Issue Type: Bug Components: packaging Affects Versions: 1.5 Environment: OSGI, Apache Felix via Apache Sling Launcher Reporter: Rupert Westenthaler Fix For: 1.6 Attachments: TIKA-1276_20140423_rwesten.diff, TIKA-1276_20140428_rwesten.diff While updating from tika 1.2 to 1.5 I that the `org.apache.tika:tika-bundle:1.5` module has some missing dependences. 1. `com.uwyn:jhighlight:1.0` is not embedded Because of that installing the bundle results in the following exception {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 2. `org.ow2.asm:asm:4.1` is not embedded because `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and therefore the `Embed-Dependency` directive `asm` does not match any dependency. Because of that one do get the following exception (after fixing (1)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0 org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0))) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} There are two possibilities to fix this (a) change the `Embed-Dependency` to `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the tika-bundle pom file. 3. `edu.ucar:netcdf:4.2-min` is not embedded Because of that one does get the following exception (after fixing (1) and (2)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime After fixing the above issues the tika-bundle was started
[jira] [Commented] (TIKA-1276) Missing embedded dependencies in tika-bundle
[ https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983011#comment-13983011 ] Rupert Westenthaler commented on TIKA-1276: --- [~olegt] I was getting the same error so I added a configuration to import this package from the environment (see the SYS_PKG constant in BundleIT). You getting the error indicates that your environment can not provide such packages. Thinking again about it: There is no good reason why Tika should depend on those packages. Adding javax.servlet;resolution:=optional, javax.servlet.http;resolution:=optional, instructions to the Import-Package does also fix this issue and is much more elegant as it will allow to use the tika bundle also in environments without a servlet engine. I will provide an update patch Missing embedded dependencies in tika-bundle Key: TIKA-1276 URL: https://issues.apache.org/jira/browse/TIKA-1276 Project: Tika Issue Type: Bug Components: packaging Affects Versions: 1.5 Environment: OSGI, Apache Felix via Apache Sling Launcher Reporter: Rupert Westenthaler Fix For: 1.6 Attachments: TIKA-1276_20140423_rwesten.diff, TIKA-1276_20140428_rwesten.diff While updating from tika 1.2 to 1.5 I that the `org.apache.tika:tika-bundle:1.5` module has some missing dependences. 1. `com.uwyn:jhighlight:1.0` is not embedded Because of that installing the bundle results in the following exception {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 2. `org.ow2.asm:asm:4.1` is not embedded because `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and therefore the `Embed-Dependency` directive `asm` does not match any dependency. Because of that one do get the following exception (after fixing (1)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0 org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0))) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} There are two possibilities to fix this (a) change the `Embed-Dependency` to `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the tika-bundle pom file. 3. `edu.ucar:netcdf:4.2-min` is not embedded Because of that one does get the following exception (after fixing (1) and (2)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime After fixing the above issues the tika-bundle was started successfully.
[jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle
[ https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rupert Westenthaler updated TIKA-1276: -- Attachment: TIKA-1276_20140428_2_rwesten.diff Attached a revised patch (TIKA-1276_20140428_2_rwesten.diff) that makes the `javax.servlet` API an optional dependency Missing embedded dependencies in tika-bundle Key: TIKA-1276 URL: https://issues.apache.org/jira/browse/TIKA-1276 Project: Tika Issue Type: Bug Components: packaging Affects Versions: 1.5 Environment: OSGI, Apache Felix via Apache Sling Launcher Reporter: Rupert Westenthaler Fix For: 1.6 Attachments: TIKA-1276_20140423_rwesten.diff, TIKA-1276_20140428_2_rwesten.diff, TIKA-1276_20140428_rwesten.diff While updating from tika 1.2 to 1.5 I that the `org.apache.tika:tika-bundle:1.5` module has some missing dependences. 1. `com.uwyn:jhighlight:1.0` is not embedded Because of that installing the bundle results in the following exception {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 2. `org.ow2.asm:asm:4.1` is not embedded because `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and therefore the `Embed-Dependency` directive `asm` does not match any dependency. Because of that one do get the following exception (after fixing (1)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0 org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0))) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} There are two possibilities to fix this (a) change the `Embed-Dependency` to `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the tika-bundle pom file. 3. `edu.ucar:netcdf:4.2-min` is not embedded Because of that one does get the following exception (after fixing (1) and (2)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime After fixing the above issues the tika-bundle was started successfully. However when extracting EXIG metadata from a jpeg image I got the following exception. {code} java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException at com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112) at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71) at org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91) at
Re: [jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle
Hi Rupert, agree about javax.servlet;resolution:=optional, javax.servlet.http;resolution:=optional, Will check it out tomorrow. Thanks !!! On Mon, Apr 28, 2014 at 4:44 PM, Rupert Westenthaler (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] Rupert Westenthaler updated TIKA-1276: -- Attachment: TIKA-1276_20140428_2_rwesten.diff Attached a revised patch (TIKA-1276_20140428_2_rwesten.diff) that makes the `javax.servlet` API an optional dependency Missing embedded dependencies in tika-bundle Key: TIKA-1276 URL: https://issues.apache.org/jira/browse/TIKA-1276 Project: Tika Issue Type: Bug Components: packaging Affects Versions: 1.5 Environment: OSGI, Apache Felix via Apache Sling Launcher Reporter: Rupert Westenthaler Fix For: 1.6 Attachments: TIKA-1276_20140423_rwesten.diff, TIKA-1276_20140428_2_rwesten.diff, TIKA-1276_20140428_rwesten.diff While updating from tika 1.2 to 1.5 I that the `org.apache.tika:tika-bundle:1.5` module has some missing dependences. 1. `com.uwyn:jhighlight:1.0` is not embedded Because of that installing the bundle results in the following exception {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 2. `org.ow2.asm:asm:4.1` is not embedded because `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and therefore the `Embed-Dependency` directive `asm` does not match any dependency. Because of that one do get the following exception (after fixing (1)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0 org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0))) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} There are two possibilities to fix this (a) change the `Embed-Dependency` to `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the tika-bundle pom file. 3. `edu.ucar:netcdf:4.2-min` is not embedded Because of that one does get the following exception (after fixing (1) and (2)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime After fixing the above issues the tika-bundle was started successfully. However when extracting EXIG metadata from a jpeg image I got the following exception. {code} java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException at
[jira] [Created] (TIKA-1281) Additional XML type: application/x-xml
Avi created TIKA-1281: - Summary: Additional XML type: application/x-xml Key: TIKA-1281 URL: https://issues.apache.org/jira/browse/TIKA-1281 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.5 Reporter: Avi Priority: Minor Fix For: 1.6 The following MediaType is not yet supported by Tika (not as a Media Type or an Alias): application/x-xml I am no Media-Type expert, but if someone here at Tika is, then I suggest looking into it and if he sees fit then add it to the Tika Registry. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1282) Additional Gzip types:
Avi created TIKA-1282: - Summary: Additional Gzip types: Key: TIKA-1282 URL: https://issues.apache.org/jira/browse/TIKA-1282 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.5 Reporter: Avi Priority: Minor Fix For: 1.6 I found several GZip mime types (which were supported by our group till we began using Tika) which aren't listed in the Tika registry. Now, I am not sure if they are legit or not, and I think that a Tika member will be able to investigate and decide if they should enter as mime types or aliases to gzip. These are the types: application/x-gunzip application/gzipped application/gzip-compressed gzip/document They can be found listed here: http://mimeapplication.net/x-gunzip -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1282) Additional Gzip types:
[ https://issues.apache.org/jira/browse/TIKA-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983074#comment-13983074 ] Nick Burch commented on TIKA-1282: -- Are these all aliases of the main gzip type, or do they actually refer to different kinds of things? Additional Gzip types: --- Key: TIKA-1282 URL: https://issues.apache.org/jira/browse/TIKA-1282 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.5 Reporter: Avi Priority: Minor Labels: gzip, mediaType, mime Fix For: 1.6 I found several GZip mime types (which were supported by our group till we began using Tika) which aren't listed in the Tika registry. Now, I am not sure if they are legit or not, and I think that a Tika member will be able to investigate and decide if they should enter as mime types or aliases to gzip. These are the types: application/x-gunzip application/gzipped application/gzip-compressed gzip/document They can be found listed here: http://mimeapplication.net/x-gunzip -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1282) Additional Gzip types:
[ https://issues.apache.org/jira/browse/TIKA-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983079#comment-13983079 ] Avi commented on TIKA-1282: --- We used to handle them as aliases to the main GZ media type (Where we is the crawler-commons group). But I am no expert, and we might have judged them wrong. Additional Gzip types: --- Key: TIKA-1282 URL: https://issues.apache.org/jira/browse/TIKA-1282 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.5 Reporter: Avi Priority: Minor Labels: gzip, mediaType, mime Fix For: 1.6 I found several GZip mime types (which were supported by our group till we began using Tika) which aren't listed in the Tika registry. Now, I am not sure if they are legit or not, and I think that a Tika member will be able to investigate and decide if they should enter as mime types or aliases to gzip. These are the types: application/x-gunzip application/gzipped application/gzip-compressed gzip/document They can be found listed here: http://mimeapplication.net/x-gunzip -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1281) Additional XML type: application/x-xml
[ https://issues.apache.org/jira/browse/TIKA-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1281. -- Resolution: Fixed Added as an alias of application/xml in r1590667. Additional XML type: application/x-xml -- Key: TIKA-1281 URL: https://issues.apache.org/jira/browse/TIKA-1281 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.5 Reporter: Avi Priority: Minor Labels: mediaType, xml Fix For: 1.6 The following MediaType is not yet supported by Tika (not as a Media Type or an Alias): application/x-xml I am no Media-Type expert, but if someone here at Tika is, then I suggest looking into it and if he sees fit then add it to the Tika Registry. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties
Tim Allison created TIKA-1283: - Summary: Add thumbnail as possible metadata item to TikaCoreProperties Key: TIKA-1283 URL: https://issues.apache.org/jira/browse/TIKA-1283 Project: Tika Issue Type: Improvement Components: metadata Reporter: Tim Allison Priority: Minor TIKA-90 originally requested to add thumbnails to a document's metadata. I'd like to have a unified way of determining whether an embedded document/resource is a thumbnail or a regular attachment. With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling out more thumbnails than before. I propose adding tika:thumbnail to the metadata of each embedded document. The consumer can then determine what to do with the embedded resource. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties
[ https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983163#comment-13983163 ] Tim Allison commented on TIKA-1283: --- I look forward to feedback on this issue. I think there is a fairly clear distinction between thumbnail and attached image, but this might get murky. On specific document types, there are some issues: * RTF is easy * ooxml now has a literal thumbnail, but there are also the emf and wmf files that do not have a literal thumbnail relationship...how do we handle these? * pre-ooxml office...haven't dug deeply yet, but thumbnails there are emf and wmf...no? * PDF...I'd also like to be able to distinguish between attached image files and embedded image files (TIKA-1268), but this is better handled as a separate issue? *other formats?? Add thumbnail as possible metadata item to TikaCoreProperties --- Key: TIKA-1283 URL: https://issues.apache.org/jira/browse/TIKA-1283 Project: Tika Issue Type: Improvement Components: metadata Reporter: Tim Allison Priority: Minor TIKA-90 originally requested to add thumbnails to a document's metadata. I'd like to have a unified way of determining whether an embedded document/resource is a thumbnail or a regular attachment. With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling out more thumbnails than before. I propose adding tika:thumbnail to the metadata of each embedded document. The consumer can then determine what to do with the embedded resource. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties
[ https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1283: -- Description: TIKA-90 originally requested to add thumbnails to a document's metadata. I'd like to have a unified way of determining whether an embedded document/resource is a thumbnail or a regular attachment. With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling out more thumbnails than before. I propose adding tika:thumbnail to the metadata of each thumbnail image. The consumer can then determine what to do with the embedded resource based on the metadata. was: TIKA-90 originally requested to add thumbnails to a document's metadata. I'd like to have a unified way of determining whether an embedded document/resource is a thumbnail or a regular attachment. With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling out more thumbnails than before. I propose adding tika:thumbnail to the metadata of each embedded document. The consumer can then determine what to do with the embedded resource. Add thumbnail as possible metadata item to TikaCoreProperties --- Key: TIKA-1283 URL: https://issues.apache.org/jira/browse/TIKA-1283 Project: Tika Issue Type: Improvement Components: metadata Reporter: Tim Allison Priority: Minor TIKA-90 originally requested to add thumbnails to a document's metadata. I'd like to have a unified way of determining whether an embedded document/resource is a thumbnail or a regular attachment. With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling out more thumbnails than before. I propose adding tika:thumbnail to the metadata of each thumbnail image. The consumer can then determine what to do with the embedded resource based on the metadata. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties
[ https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983191#comment-13983191 ] Tim Allison commented on TIKA-1283: --- Y, I absolutely agree with the distinction. Is there a clean way of implementing that that wouldn't break too much? Perhaps treat them as very different from the regular .get(String/Property...) in Metadata: {noformat} byte[] tn = metadata.getThumbnailData() {noformat} One argument against this is that clients would then have to add the step of extracting thumbnails from the metadata and EmbeddedResourceHandler would no longer pull everything as elegantly as it does now (if the user wants all attachments and thumbnails). Let me look into how hard it will be to associate a thumbnail with an embedded resource. RTF is easy, but the microsoft/ooxml might be a bit messy. Add thumbnail as possible metadata item to TikaCoreProperties --- Key: TIKA-1283 URL: https://issues.apache.org/jira/browse/TIKA-1283 Project: Tika Issue Type: Improvement Components: metadata Reporter: Tim Allison Priority: Minor TIKA-90 originally requested to add thumbnails to a document's metadata. I'd like to have a unified way of determining whether an embedded document/resource is a thumbnail or a regular attachment. With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling out more thumbnails than before. I propose adding tika:thumbnail to the metadata of each thumbnail image. The consumer can then determine what to do with the embedded resource based on the metadata. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties
[ https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983434#comment-13983434 ] Hong-Thai Nguyen commented on TIKA-1283: +1 for me to create a thumbnail field in metadata Set. - For OOXML, that's an item inside archive (see TIKA-1223). PowerPoint has always embedded thumbnail in Jpeg, but optional with docx xlsx (available only when user check on 'save preview' option when saving document). - For OLE Documents, see: http://poi.apache.org/hpsf/thumbnails.html. You can get thumbnail content from POI API: {code} static byte[] process(File docFile) throws Exception { final HWPFDocumentCore wordDocument = AbstractWordUtils.loadDoc(docFile); SummaryInformation summaryInformation = wordDocument.getSummaryInformation(); System.out.println(summaryInformation.getAuthor()); System.out.println(summaryInformation.getApplicationName() + : + summaryInformation.getTitle()); Thumbnail thumbnail = new Thumbnail(summaryInformation.getThumbnail()); System.out.println(thumbnail.getClipboardFormat()); System.out.println(thumbnail.getClipboardFormatTag()); return thumbnail.getThumbnailAsWMF(); } {code} Unfortunately , there's an open bug on POI to get properly thumbnail content: https://issues.apache.org/bugzilla/show_bug.cgi?id=56194 docx, xlsx ole formats, they are WMF EMF formats. Quite difficult to handle these kind of images. But, this is out of our scope. Add thumbnail as possible metadata item to TikaCoreProperties --- Key: TIKA-1283 URL: https://issues.apache.org/jira/browse/TIKA-1283 Project: Tika Issue Type: Improvement Components: metadata Reporter: Tim Allison Priority: Minor TIKA-90 originally requested to add thumbnails to a document's metadata. I'd like to have a unified way of determining whether an embedded document/resource is a thumbnail or a regular attachment. With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling out more thumbnails than before. I propose adding tika:thumbnail to the metadata of each thumbnail image. The consumer can then determine what to do with the embedded resource based on the metadata. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (TIKA-1274) ENVI header parser
[ https://issues.apache.org/jira/browse/TIKA-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ann Burgess updated TIKA-1274: -- Comment: was deleted (was: Hey Chris, How is your week looking? Want to set a time to do a chat? I'm actually home sick today, out with a nasty cold that started yesterday. Later in the week might work best, so I'm lucid. AB On Mon, Apr 21, 2014 at 1:39 PM, Chris A. Mattmann (JIRA) -- -- Ann Bryant Burgess, PhD Postdoctoral Fellow Computer Science Department University of Southern California Viterbi School of Engineering Los Angeles, CA Alaska Science Center/USGS Anchorage, AK Cell: (585) 738-7549 Office: (907) 786-7059 Fax: (907) 786-7150 E-mail: anniebryant.burg...@gmail.com Office Address: 4210 University Dr., Anchorage, AK 99508-4626 --- ) ENVI header parser -- Key: TIKA-1274 URL: https://issues.apache.org/jira/browse/TIKA-1274 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.5 Reporter: Ann Burgess Assignee: Chris A. Mattmann Labels: mime, newbie, parser, patch I have written a parser that extracts text and metadata from ENVI header files, currently called at the command line as: abryant:tika abryant$ java -classpath annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --metadata MOD09GA_test_header.hdr Content-Encoding: ISO-8859-1 Content-Length: 818 Content-Type: application/envi.hdr resourceName: MOD09GA_test_header.hdr abryant:tika abryant$ java -classpath annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --text MOD09GA_test_header.hdr ENVI description = { GEO-TIFF File Imported into ENVI [Fri May 25 14:06:23 2012]} samples = 2400 lines = 2400 bands = 7 header offset = 0 file type = ENVI Standard data type = 2 interleave = bip sensor type = Unknown byte order = 0 map info = {Sinusoidal, 1.5000, 1.5000, -10007091.3643, 5559289.2856, 4.6331271653e+02, 4.6331271653e+02, , units=Meters} projection info = {16, 6371007.2, 0.00, 0.0, 0.0, Sinusoidal, units=Meters} coordinate system string = {PROJCS[Sinusoidal,GEOGCS[GCS_ELLIPSE_BASED_1,DATUM[D_ELLIPSE_BASED_1,SPHEROID[S_ELLIPSE_BASED_1,6371007.181,0.0]],PRIMEM[Greenwich,0.0],UNIT[Degree,0.0174532925199433]],PROJECTION[Sinusoidal],PARAMETER[False_Easting,0.0],PARAMETER[False_Northing,0.0],PARAMETER[Central_Meridian,0.0],UNIT[Meter,1.0]]} wavelength units = Unknown __ As a current non-certified committer, could someone enlighten me to the steps needed to submit this new parser for review. The parser is located in my directory structure as: /users/annbryant/tika/tika/anniedev/src/main/java/edu/usc/sunset/abburgess/tika/EnviFileReader.class My custom mimetypes.xml file is located at: /Users/annbryant/TIKA/tika/anniedev/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1274) ENVI header parser
[ https://issues.apache.org/jira/browse/TIKA-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983500#comment-13983500 ] Ann Burgess commented on TIKA-1274: --- I've got the EnviHeaderParser and EnviHeaderParserTest (unit test) files now on github: https://github.com/abburgess/ENVIJava I've run the unit test successfully in maven. If this looks good, I will create a patch for review. ENVI header parser -- Key: TIKA-1274 URL: https://issues.apache.org/jira/browse/TIKA-1274 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.5 Reporter: Ann Burgess Assignee: Chris A. Mattmann Labels: mime, newbie, parser, patch I have written a parser that extracts text and metadata from ENVI header files, currently called at the command line as: abryant:tika abryant$ java -classpath annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --metadata MOD09GA_test_header.hdr Content-Encoding: ISO-8859-1 Content-Length: 818 Content-Type: application/envi.hdr resourceName: MOD09GA_test_header.hdr abryant:tika abryant$ java -classpath annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --text MOD09GA_test_header.hdr ENVI description = { GEO-TIFF File Imported into ENVI [Fri May 25 14:06:23 2012]} samples = 2400 lines = 2400 bands = 7 header offset = 0 file type = ENVI Standard data type = 2 interleave = bip sensor type = Unknown byte order = 0 map info = {Sinusoidal, 1.5000, 1.5000, -10007091.3643, 5559289.2856, 4.6331271653e+02, 4.6331271653e+02, , units=Meters} projection info = {16, 6371007.2, 0.00, 0.0, 0.0, Sinusoidal, units=Meters} coordinate system string = {PROJCS[Sinusoidal,GEOGCS[GCS_ELLIPSE_BASED_1,DATUM[D_ELLIPSE_BASED_1,SPHEROID[S_ELLIPSE_BASED_1,6371007.181,0.0]],PRIMEM[Greenwich,0.0],UNIT[Degree,0.0174532925199433]],PROJECTION[Sinusoidal],PARAMETER[False_Easting,0.0],PARAMETER[False_Northing,0.0],PARAMETER[Central_Meridian,0.0],UNIT[Meter,1.0]]} wavelength units = Unknown __ As a current non-certified committer, could someone enlighten me to the steps needed to submit this new parser for review. The parser is located in my directory structure as: /users/annbryant/tika/tika/anniedev/src/main/java/edu/usc/sunset/abburgess/tika/EnviFileReader.class My custom mimetypes.xml file is located at: /Users/annbryant/TIKA/tika/anniedev/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1274) ENVI header parser
[ https://issues.apache.org/jira/browse/TIKA-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983523#comment-13983523 ] Nick Burch commented on TIKA-1274: -- Few quick bits: * There's a few files in that git repo that wouldn't normally be there - eg .class files and a /target/ directory * You seem to have some inconsistent indenting going on - IIRC Tika uses 4 spaces no tabs Secondly, you seem to be outputting the raw contents of the file as the textual part, but not doing any parsing of any parts into the metadata. At first glance (and I'm not an ENVI file format expert here!), I would've expected things like samples = 2400 to get mapped onto some sort of suitable metadata key/value pair Are you able to dig out any documentation on the format of the ENVI header file? If so, we may be able to help suggest which bits of it may be best placed into the metadata object, and also what of that can use standard metadata keys + which ones will need new metadata keys defining to be used ENVI header parser -- Key: TIKA-1274 URL: https://issues.apache.org/jira/browse/TIKA-1274 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.5 Reporter: Ann Burgess Assignee: Chris A. Mattmann Labels: mime, newbie, parser, patch I have written a parser that extracts text and metadata from ENVI header files, currently called at the command line as: abryant:tika abryant$ java -classpath annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --metadata MOD09GA_test_header.hdr Content-Encoding: ISO-8859-1 Content-Length: 818 Content-Type: application/envi.hdr resourceName: MOD09GA_test_header.hdr abryant:tika abryant$ java -classpath annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --text MOD09GA_test_header.hdr ENVI description = { GEO-TIFF File Imported into ENVI [Fri May 25 14:06:23 2012]} samples = 2400 lines = 2400 bands = 7 header offset = 0 file type = ENVI Standard data type = 2 interleave = bip sensor type = Unknown byte order = 0 map info = {Sinusoidal, 1.5000, 1.5000, -10007091.3643, 5559289.2856, 4.6331271653e+02, 4.6331271653e+02, , units=Meters} projection info = {16, 6371007.2, 0.00, 0.0, 0.0, Sinusoidal, units=Meters} coordinate system string = {PROJCS[Sinusoidal,GEOGCS[GCS_ELLIPSE_BASED_1,DATUM[D_ELLIPSE_BASED_1,SPHEROID[S_ELLIPSE_BASED_1,6371007.181,0.0]],PRIMEM[Greenwich,0.0],UNIT[Degree,0.0174532925199433]],PROJECTION[Sinusoidal],PARAMETER[False_Easting,0.0],PARAMETER[False_Northing,0.0],PARAMETER[Central_Meridian,0.0],UNIT[Meter,1.0]]} wavelength units = Unknown __ As a current non-certified committer, could someone enlighten me to the steps needed to submit this new parser for review. The parser is located in my directory structure as: /users/annbryant/tika/tika/anniedev/src/main/java/edu/usc/sunset/abburgess/tika/EnviFileReader.class My custom mimetypes.xml file is located at: /Users/annbryant/TIKA/tika/anniedev/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1274) ENVI header parser
[ https://issues.apache.org/jira/browse/TIKA-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983597#comment-13983597 ] Ann Burgess commented on TIKA-1274: --- Hi Nick, Thank you for the git repo tips. I added the 'target' directory and I was mimicking the directory structure of the tika build - consider it removed. On that note, I'd appreciate any documentation on the dos and don'ts of building a git repo for Tika or other Apache projects... if such documentation exists. As for the file contents, ENVI header fileshttp://www.exelisvis.com/docs/ENVIHeaderFiles.htmlare plain text documents. The contents of the ENVI header files are, in fact, metadata for a corresponding data file, i.e. to read a file named some_file.img, it requires the corresponding file some_file.img.hdr. In other words, because the entire contents of a some_file.img.hdr file is metadata for some_file.img, the actual contents of the some_file.img.hdr file do NOT describe the .hdr file itself, rather they describe the .img file. That is why I didn't think it appropriate to move parts of the 'raw content' into metadata. Does that make sense? I'm also very open to how this sort of thing is normally treated or to open a conversation about the topic of how to treat one file type describing another file type. Thanks for the input and any further suggestions. -- -- Ann Bryant Burgess, PhD Postdoctoral Fellow Computer Science Department University of Southern California Viterbi School of Engineering Los Angeles, CA Alaska Science Center/USGS Anchorage, AK Cell: (585) 738-7549 Office: (907) 786-7059 Fax: (907) 786-7150 E-mail: anniebryant.burg...@gmail.com Office Address: 4210 University Dr., Anchorage, AK 99508-4626 --- ENVI header parser -- Key: TIKA-1274 URL: https://issues.apache.org/jira/browse/TIKA-1274 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.5 Reporter: Ann Burgess Assignee: Chris A. Mattmann Labels: mime, newbie, parser, patch I have written a parser that extracts text and metadata from ENVI header files, currently called at the command line as: abryant:tika abryant$ java -classpath annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --metadata MOD09GA_test_header.hdr Content-Encoding: ISO-8859-1 Content-Length: 818 Content-Type: application/envi.hdr resourceName: MOD09GA_test_header.hdr abryant:tika abryant$ java -classpath annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --text MOD09GA_test_header.hdr ENVI description = { GEO-TIFF File Imported into ENVI [Fri May 25 14:06:23 2012]} samples = 2400 lines = 2400 bands = 7 header offset = 0 file type = ENVI Standard data type = 2 interleave = bip sensor type = Unknown byte order = 0 map info = {Sinusoidal, 1.5000, 1.5000, -10007091.3643, 5559289.2856, 4.6331271653e+02, 4.6331271653e+02, , units=Meters} projection info = {16, 6371007.2, 0.00, 0.0, 0.0, Sinusoidal, units=Meters} coordinate system string = {PROJCS[Sinusoidal,GEOGCS[GCS_ELLIPSE_BASED_1,DATUM[D_ELLIPSE_BASED_1,SPHEROID[S_ELLIPSE_BASED_1,6371007.181,0.0]],PRIMEM[Greenwich,0.0],UNIT[Degree,0.0174532925199433]],PROJECTION[Sinusoidal],PARAMETER[False_Easting,0.0],PARAMETER[False_Northing,0.0],PARAMETER[Central_Meridian,0.0],UNIT[Meter,1.0]]} wavelength units = Unknown __ As a current non-certified committer, could someone enlighten me to the steps needed to submit this new parser for review. The parser is located in my directory structure as: /users/annbryant/tika/tika/anniedev/src/main/java/edu/usc/sunset/abburgess/tika/EnviFileReader.class My custom mimetypes.xml file is located at: /Users/annbryant/TIKA/tika/anniedev/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1274) ENVI header parser
[ https://issues.apache.org/jira/browse/TIKA-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983597#comment-13983597 ] Ann Burgess edited comment on TIKA-1274 at 4/28/14 11:10 PM: - Hi Nick, Thank you for the git repo tips. I added the 'target' directory and I was mimicking the directory structure of the tika build - consider it removed. On that note, I'd appreciate any documentation on the dos and don'ts of building a git repo for Tika or other Apache projects... if such documentation exists. As for the file contents, ENVI header fileshttp://www.exelisvis.com/docs/ENVIHeaderFiles.htmlare plain text documents. The contents of the ENVI header files are, in fact, metadata for a corresponding data file, i.e. to read a file named some_file.img, it requires the corresponding file some_file.img.hdr. In other words, because the entire contents of a some_file.img.hdr file is metadata for some_file.img, the actual contents of the some_file.img.hdr file do NOT describe the .hdr file itself, rather they describe the .img file. That is why I didn't think it appropriate to move parts of the 'raw content' into metadata. Does that make sense? I'm also very open to how this sort of thing is normally treated or to open a conversation about the topic of how to treat one file type describing another file type. Thanks for the input and any further suggestions. was (Author: annieburgess): Hi Nick, Thank you for the git repo tips. I added the 'target' directory and I was mimicking the directory structure of the tika build - consider it removed. On that note, I'd appreciate any documentation on the dos and don'ts of building a git repo for Tika or other Apache projects... if such documentation exists. As for the file contents, ENVI header fileshttp://www.exelisvis.com/docs/ENVIHeaderFiles.htmlare plain text documents. The contents of the ENVI header files are, in fact, metadata for a corresponding data file, i.e. to read a file named some_file.img, it requires the corresponding file some_file.img.hdr. In other words, because the entire contents of a some_file.img.hdr file is metadata for some_file.img, the actual contents of the some_file.img.hdr file do NOT describe the .hdr file itself, rather they describe the .img file. That is why I didn't think it appropriate to move parts of the 'raw content' into metadata. Does that make sense? I'm also very open to how this sort of thing is normally treated or to open a conversation about the topic of how to treat one file type describing another file type. Thanks for the input and any further suggestions. -- -- Ann Bryant Burgess, PhD Postdoctoral Fellow Computer Science Department University of Southern California Viterbi School of Engineering Los Angeles, CA Alaska Science Center/USGS Anchorage, AK Cell: (585) 738-7549 Office: (907) 786-7059 Fax: (907) 786-7150 E-mail: anniebryant.burg...@gmail.com Office Address: 4210 University Dr., Anchorage, AK 99508-4626 --- ENVI header parser -- Key: TIKA-1274 URL: https://issues.apache.org/jira/browse/TIKA-1274 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.5 Reporter: Ann Burgess Assignee: Chris A. Mattmann Labels: mime, newbie, parser, patch I have written a parser that extracts text and metadata from ENVI header files, currently called at the command line as: abryant:tika abryant$ java -classpath annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --metadata MOD09GA_test_header.hdr Content-Encoding: ISO-8859-1 Content-Length: 818 Content-Type: application/envi.hdr resourceName: MOD09GA_test_header.hdr abryant:tika abryant$ java -classpath annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --text MOD09GA_test_header.hdr ENVI description = { GEO-TIFF File Imported into ENVI [Fri May 25 14:06:23 2012]} samples = 2400 lines = 2400 bands = 7 header offset = 0 file type = ENVI Standard data type = 2 interleave = bip sensor type = Unknown byte order = 0 map info = {Sinusoidal, 1.5000, 1.5000, -10007091.3643, 5559289.2856, 4.6331271653e+02, 4.6331271653e+02, , units=Meters} projection info = {16, 6371007.2, 0.00, 0.0, 0.0, Sinusoidal, units=Meters} coordinate system string =
[jira] [Commented] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties
[ https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983813#comment-13983813 ] Tim Allison commented on TIKA-1283: --- [~thaichat04], thank you, as always. By thumbnail, I'd also want to include images/icons of documents that are included only for display purposes. For example, the icon image (image1.emf) in test-documents/EmbeddedPDF.docx doesn't have a relationship=thumbnail, but I'd want to include that as a thumbnail because it appears as an v:shape within a w:object. The point you make about the differences in handling of these by application is right on. Each application links thumbnail images to the underlying data in different ways, and we'll have to go application by application to do this correctly (whether we go with this or TIKA-90) I'm not held to the original proposal in this issue, and I like the clarity of TIKA-90 quite a bit. Some other thoughts...the signature I proposed above won't work because a given image can have more than one thumbnail (at least for RTFs) and it misses metadata around the thumbnail image (such as mediaType of the thumbnail). Add thumbnail as possible metadata item to TikaCoreProperties --- Key: TIKA-1283 URL: https://issues.apache.org/jira/browse/TIKA-1283 Project: Tika Issue Type: Improvement Components: metadata Reporter: Tim Allison Priority: Minor TIKA-90 originally requested to add thumbnails to a document's metadata. I'd like to have a unified way of determining whether an embedded document/resource is a thumbnail or a regular attachment. With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling out more thumbnails than before. I propose adding tika:thumbnail to the metadata of each thumbnail image. The consumer can then determine what to do with the embedded resource based on the metadata. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties
[ https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983813#comment-13983813 ] Tim Allison edited comment on TIKA-1283 at 4/29/14 12:47 AM: - [~thaichat04], thank you, as always. By thumbnail, I'd also want to include images/icons of documents that are included only for display purposes. For example, the icon image (image1.emf) in test-documents/EmbeddedPDF.docx doesn't have a relationship=thumbnail, but I'd want to include that as a thumbnail because it appears as an v:shape within a w:object. The point you make about the differences in handling of these by application is right on. Each application links thumbnail images to the underlying data in different ways, and we'll have to go application by application to do this correctly (whether we go with this or TIKA-90) I'm not held to the original proposal in this issue, and I like the clarity of TIKA-90 quite a bit. Some other thoughts...the signature I proposed above won't work because a given embedded resource can have more than one thumbnail (at least for RTFs) and it misses metadata around the thumbnail image (such as mediaType of the thumbnail). was (Author: talli...@mitre.org): [~thaichat04], thank you, as always. By thumbnail, I'd also want to include images/icons of documents that are included only for display purposes. For example, the icon image (image1.emf) in test-documents/EmbeddedPDF.docx doesn't have a relationship=thumbnail, but I'd want to include that as a thumbnail because it appears as an v:shape within a w:object. The point you make about the differences in handling of these by application is right on. Each application links thumbnail images to the underlying data in different ways, and we'll have to go application by application to do this correctly (whether we go with this or TIKA-90) I'm not held to the original proposal in this issue, and I like the clarity of TIKA-90 quite a bit. Some other thoughts...the signature I proposed above won't work because a given image can have more than one thumbnail (at least for RTFs) and it misses metadata around the thumbnail image (such as mediaType of the thumbnail). Add thumbnail as possible metadata item to TikaCoreProperties --- Key: TIKA-1283 URL: https://issues.apache.org/jira/browse/TIKA-1283 Project: Tika Issue Type: Improvement Components: metadata Reporter: Tim Allison Priority: Minor TIKA-90 originally requested to add thumbnails to a document's metadata. I'd like to have a unified way of determining whether an embedded document/resource is a thumbnail or a regular attachment. With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling out more thumbnails than before. I propose adding tika:thumbnail to the metadata of each thumbnail image. The consumer can then determine what to do with the embedded resource based on the metadata. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties
[ https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983813#comment-13983813 ] Tim Allison edited comment on TIKA-1283 at 4/29/14 1:15 AM: [~thaichat04], thank you, as always. By thumbnail, I was thinking of container files that hold an image of an attachment for display purposes as well as the attachment data. For example, the icon image (image1.emf) in test-documents/EmbeddedPDF.docx...I'd want to include that as a thumbnail because it appears as an v:shape within a w:object. I think what you're describing is a file that contains a thumbnail of itself. Both types of thumbnails (self- vs other-) could be handled equivalently in either TIKA-90 or this proposal...for now they are both treated as attachments along with the other traditional attachments. The point you make about the differences in handling of these by application is right on. Each application links thumbnail images to the underlying data in different ways, and we'll have to go application by application to do this correctly (whether we go with this or TIKA-90) I'm not held to the original proposal in this issue, and I like the clarity of TIKA-90 quite a bit. Some other thoughts...the signature I proposed above won't work because a given embedded resource can have more than one thumbnail (at least for RTFs) and it misses metadata around the thumbnail image (such as mediaType of the thumbnail). was (Author: talli...@mitre.org): [~thaichat04], thank you, as always. By thumbnail, I'd also want to include images/icons of documents that are included only for display purposes. For example, the icon image (image1.emf) in test-documents/EmbeddedPDF.docx doesn't have a relationship=thumbnail, but I'd want to include that as a thumbnail because it appears as an v:shape within a w:object. The point you make about the differences in handling of these by application is right on. Each application links thumbnail images to the underlying data in different ways, and we'll have to go application by application to do this correctly (whether we go with this or TIKA-90) I'm not held to the original proposal in this issue, and I like the clarity of TIKA-90 quite a bit. Some other thoughts...the signature I proposed above won't work because a given embedded resource can have more than one thumbnail (at least for RTFs) and it misses metadata around the thumbnail image (such as mediaType of the thumbnail). Add thumbnail as possible metadata item to TikaCoreProperties --- Key: TIKA-1283 URL: https://issues.apache.org/jira/browse/TIKA-1283 Project: Tika Issue Type: Improvement Components: metadata Reporter: Tim Allison Priority: Minor TIKA-90 originally requested to add thumbnails to a document's metadata. I'd like to have a unified way of determining whether an embedded document/resource is a thumbnail or a regular attachment. With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling out more thumbnails than before. I propose adding tika:thumbnail to the metadata of each thumbnail image. The consumer can then determine what to do with the embedded resource based on the metadata. -- This message was sent by Atlassian JIRA (v6.2#6252)