[jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle

2014-04-28 Thread Rupert Westenthaler (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rupert Westenthaler updated TIKA-1276:
--

Attachment: TIKA-1276_20140428_rwesten.diff

Hi all

Added a 2nd patch that

* (re-)enables unit tests for the Tika OSGI bundle
* updated the Bundle Activator in tika-parsers to register a Detector and 
Parser that is similar as those returned by Tika#getDetector() and 
Tika#getParser()
* the tests now check both (1) usage of the Tika class AND (2) usage of the 
Detector and Parser registered as OSGI services by the Bundle Activator
* updated the tests to use the latest versions of pax exam (3.5) and felix (4.4)
* this does not add additional tests for different media types. Those should 
bee added to the BundleIT#testParser

This patch does not include the first one. It is based on a trunk version with 
the first patch already applied.



 Missing embedded dependencies in tika-bundle
 

 Key: TIKA-1276
 URL: https://issues.apache.org/jira/browse/TIKA-1276
 Project: Tika
  Issue Type: Bug
  Components: packaging
Affects Versions: 1.5
 Environment: OSGI, Apache Felix via Apache Sling Launcher
Reporter: Rupert Westenthaler
 Fix For: 1.6

 Attachments: TIKA-1276_20140423_rwesten.diff, 
 TIKA-1276_20140428_rwesten.diff


 While updating from tika 1.2 to 1.5 I that the 
 `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
 1. `com.uwyn:jhighlight:1.0` is not embedded
 Because of that installing the bundle results in the following exception
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 2. `org.ow2.asm:asm:4.1` is not embedded because 
 `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and 
 therefore the `Embed-Dependency` directive `asm` does not match any 
 dependency. 
 Because of that one do get the following exception (after fixing (1))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0)))
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 There are two possibilities to fix this (a) change the `Embed-Dependency` to 
 `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the 
 tika-bundle pom file.
 3. `edu.ucar:netcdf:4.2-min` is not embedded
 Because of that one does get the following exception (after fixing (1) and 
 (2))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime
 After fixing the above 

Calling OSGi experts - TIKA-1276 patch review

2014-04-28 Thread Nick Burch

Hi All

I know enough OSGi to be dangerous, but not enough to be sure of exactly 
what I should and shouldn't do...


On TIKA-1276 we've got some suggested patches from Rupert Westenthaler 
which hopefully fix some Tika OSGi problems, as well as adding some more 
unit tests for the OSGi support.


Any chance that someone who knows OSGi very well could review the patch, 
and either apply it (committer) or add a comment to the bug saying it's 
good to go (non-committer)?


Thanks
Nick



[jira] [Commented] (TIKA-1276) Missing embedded dependencies in tika-bundle

2014-04-28 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13982874#comment-13982874
 ] 

Sergey Beryozkin commented on TIKA-1276:


Hi Rupert

I wonder should we take a completely different approach and avoid embedding at 
all which is not very OSGI friendly ?
May be not for 1.6 but for some major release like Tika 2.0...

Sergey

 Missing embedded dependencies in tika-bundle
 

 Key: TIKA-1276
 URL: https://issues.apache.org/jira/browse/TIKA-1276
 Project: Tika
  Issue Type: Bug
  Components: packaging
Affects Versions: 1.5
 Environment: OSGI, Apache Felix via Apache Sling Launcher
Reporter: Rupert Westenthaler
 Fix For: 1.6

 Attachments: TIKA-1276_20140423_rwesten.diff, 
 TIKA-1276_20140428_rwesten.diff


 While updating from tika 1.2 to 1.5 I that the 
 `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
 1. `com.uwyn:jhighlight:1.0` is not embedded
 Because of that installing the bundle results in the following exception
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 2. `org.ow2.asm:asm:4.1` is not embedded because 
 `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and 
 therefore the `Embed-Dependency` directive `asm` does not match any 
 dependency. 
 Because of that one do get the following exception (after fixing (1))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0)))
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 There are two possibilities to fix this (a) change the `Embed-Dependency` to 
 `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the 
 tika-bundle pom file.
 3. `edu.ucar:netcdf:4.2-min` is not embedded
 Because of that one does get the following exception (after fixing (1) and 
 (2))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime
 After fixing the above issues the tika-bundle was started successfully. 
 However when extracting EXIG metadata from a jpeg image I got the following 
 exception.
 {code}
 java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException
   at 
 com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
   at 
 com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
   at 
 

Shared MIME info update

2014-04-28 Thread Matthias Krueger


Hi all,

I ran a diff on tika-mimetypes.xml and the latest Freedesktop share MIME 
info DB release (http://cgit.freedesktop.org/xdg/shared-mime-info/). It 
seems they have diverged quite a lot. Do you see benefit in bringing 
them closer together again? Or is licensing in the way (I think they 
dual license with LGPL anf AFL 2.0)?


Thanks
Matthias




Re: Shared MIME info update

2014-04-28 Thread Nick Burch

On Mon, 28 Apr 2014, Matthias Krueger wrote:
I ran a diff on tika-mimetypes.xml and the latest Freedesktop share MIME 
info DB release (http://cgit.freedesktop.org/xdg/shared-mime-info/). It 
seems they have diverged quite a lot.


I don't think they've ever been the same. We use their XML format, but not 
their data. Our data comes from a mixture of places, initially the httpd 
mimetypes file, along with lots of bug reports, fixes etc since them. We 
also support one or two types that they don't


Or is licensing in the way (I think they dual license with LGPL anf AFL 
2.0)?


They can take our nice work, but we can't theirs. The Apache License v2 is 
largely a universal donner license. LGPL and friends are largely not - see 
http://www.apache.org/legal/resolved.html#category-x . (It's largely the 
same thing with OpenOffice - LibreOffice are welcome to take fixes from 
Apache OpenOffice, and they do, but AOO can only take LO fixes where the 
contributor explicitly allows their changes to be Apache licensed)


Nick


[jira] [Commented] (TIKA-1276) Missing embedded dependencies in tika-bundle

2014-04-28 Thread Oleg Tikhonov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13982973#comment-13982973
 ] 

Oleg Tikhonov commented on TIKA-1276:
-

Environment: Win 7 x64; OSGi engine: Apache Felix 
without patch got org.osgi.framework.BundleException: Unresolved constraint in 
bundle org.apache.tika.bundle [7]: Unable to resolve 7.0: missing requirement 
[7.0] osgi.wiring.package; (osgi.wiring.package=javax.servlet)
Note: 7 here is a tika-bundle-1.6-SNAPSHOT.jar
with the patch:
org.osgi.framework.BundleException: Unresolved constraint in bundle 
org.apache.tika.bundle [8]: Unable to resolve 8.0: missing requirement [8.0] 
osgi.wiring.package; (osgi.wiring.package=javax.servlet)
Note: 8 here is a patched tika-bundle-1.6-SNAPSHOT.jar.

I.e in both cases cannot start.

Seems to be the same.


 Missing embedded dependencies in tika-bundle
 

 Key: TIKA-1276
 URL: https://issues.apache.org/jira/browse/TIKA-1276
 Project: Tika
  Issue Type: Bug
  Components: packaging
Affects Versions: 1.5
 Environment: OSGI, Apache Felix via Apache Sling Launcher
Reporter: Rupert Westenthaler
 Fix For: 1.6

 Attachments: TIKA-1276_20140423_rwesten.diff, 
 TIKA-1276_20140428_rwesten.diff


 While updating from tika 1.2 to 1.5 I that the 
 `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
 1. `com.uwyn:jhighlight:1.0` is not embedded
 Because of that installing the bundle results in the following exception
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 2. `org.ow2.asm:asm:4.1` is not embedded because 
 `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and 
 therefore the `Embed-Dependency` directive `asm` does not match any 
 dependency. 
 Because of that one do get the following exception (after fixing (1))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0)))
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 There are two possibilities to fix this (a) change the `Embed-Dependency` to 
 `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the 
 tika-bundle pom file.
 3. `edu.ucar:netcdf:4.2-min` is not embedded
 Because of that one does get the following exception (after fixing (1) and 
 (2))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime
 After fixing the above issues the tika-bundle was started successfully. 
 However 

[jira] [Commented] (TIKA-1276) Missing embedded dependencies in tika-bundle

2014-04-28 Thread Rupert Westenthaler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13982995#comment-13982995
 ] 

Rupert Westenthaler commented on TIKA-1276:
---

Embedding dependencies is bad if those are also used by other bundles. The 
biggest dependencies of Tika are all dependencies of parsers (e.g. poi, pdf box 
...). Most Tika users will not need those in other bundles. So having them 
embedded in tika-bundle is not a overhead.

Tika does embed some dependencies that are OSGI Bundles

* commons-compress and also it dependency xz is a bundle
* commons-codec
* apache-mime4j-core and apache-mime4j-dom
* xmlbeans-2.3.0: There are bundle versions available by 
org.apache.servicemix.bundles:org.apache.servicemix.bundles.xmlbeans - starting 
from version 2.4.

Those could be easily removed from the bundle.



 Missing embedded dependencies in tika-bundle
 

 Key: TIKA-1276
 URL: https://issues.apache.org/jira/browse/TIKA-1276
 Project: Tika
  Issue Type: Bug
  Components: packaging
Affects Versions: 1.5
 Environment: OSGI, Apache Felix via Apache Sling Launcher
Reporter: Rupert Westenthaler
 Fix For: 1.6

 Attachments: TIKA-1276_20140423_rwesten.diff, 
 TIKA-1276_20140428_rwesten.diff


 While updating from tika 1.2 to 1.5 I that the 
 `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
 1. `com.uwyn:jhighlight:1.0` is not embedded
 Because of that installing the bundle results in the following exception
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 2. `org.ow2.asm:asm:4.1` is not embedded because 
 `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and 
 therefore the `Embed-Dependency` directive `asm` does not match any 
 dependency. 
 Because of that one do get the following exception (after fixing (1))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0)))
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 There are two possibilities to fix this (a) change the `Embed-Dependency` to 
 `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the 
 tika-bundle pom file.
 3. `edu.ucar:netcdf:4.2-min` is not embedded
 Because of that one does get the following exception (after fixing (1) and 
 (2))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime
 After fixing the above issues the tika-bundle was started 

[jira] [Commented] (TIKA-1276) Missing embedded dependencies in tika-bundle

2014-04-28 Thread Rupert Westenthaler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983011#comment-13983011
 ] 

Rupert Westenthaler commented on TIKA-1276:
---

[~olegt] I was getting the same error so I added a configuration to import this 
package from the environment (see the SYS_PKG constant in BundleIT). You 
getting the error indicates that your environment can not provide such packages.

Thinking again about it: There is no good reason why Tika should depend on 
those packages. Adding 

  javax.servlet;resolution:=optional,
  javax.servlet.http;resolution:=optional,

instructions to the Import-Package does also fix this issue and is much more 
elegant as it will allow to use the tika bundle also in environments without a 
servlet engine.

I will provide an update patch

 Missing embedded dependencies in tika-bundle
 

 Key: TIKA-1276
 URL: https://issues.apache.org/jira/browse/TIKA-1276
 Project: Tika
  Issue Type: Bug
  Components: packaging
Affects Versions: 1.5
 Environment: OSGI, Apache Felix via Apache Sling Launcher
Reporter: Rupert Westenthaler
 Fix For: 1.6

 Attachments: TIKA-1276_20140423_rwesten.diff, 
 TIKA-1276_20140428_rwesten.diff


 While updating from tika 1.2 to 1.5 I that the 
 `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
 1. `com.uwyn:jhighlight:1.0` is not embedded
 Because of that installing the bundle results in the following exception
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 2. `org.ow2.asm:asm:4.1` is not embedded because 
 `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and 
 therefore the `Embed-Dependency` directive `asm` does not match any 
 dependency. 
 Because of that one do get the following exception (after fixing (1))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0)))
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 There are two possibilities to fix this (a) change the `Embed-Dependency` to 
 `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the 
 tika-bundle pom file.
 3. `edu.ucar:netcdf:4.2-min` is not embedded
 Because of that one does get the following exception (after fixing (1) and 
 (2))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime
 After fixing the above issues the tika-bundle was started successfully. 
 

[jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle

2014-04-28 Thread Rupert Westenthaler (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rupert Westenthaler updated TIKA-1276:
--

Attachment: TIKA-1276_20140428_2_rwesten.diff

Attached a revised patch (TIKA-1276_20140428_2_rwesten.diff) that makes the 
`javax.servlet` API an optional dependency

 Missing embedded dependencies in tika-bundle
 

 Key: TIKA-1276
 URL: https://issues.apache.org/jira/browse/TIKA-1276
 Project: Tika
  Issue Type: Bug
  Components: packaging
Affects Versions: 1.5
 Environment: OSGI, Apache Felix via Apache Sling Launcher
Reporter: Rupert Westenthaler
 Fix For: 1.6

 Attachments: TIKA-1276_20140423_rwesten.diff, 
 TIKA-1276_20140428_2_rwesten.diff, TIKA-1276_20140428_rwesten.diff


 While updating from tika 1.2 to 1.5 I that the 
 `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
 1. `com.uwyn:jhighlight:1.0` is not embedded
 Because of that installing the bundle results in the following exception
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 2. `org.ow2.asm:asm:4.1` is not embedded because 
 `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and 
 therefore the `Embed-Dependency` directive `asm` does not match any 
 dependency. 
 Because of that one do get the following exception (after fixing (1))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0)))
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 There are two possibilities to fix this (a) change the `Embed-Dependency` to 
 `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the 
 tika-bundle pom file.
 3. `edu.ucar:netcdf:4.2-min` is not embedded
 Because of that one does get the following exception (after fixing (1) and 
 (2))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime
 After fixing the above issues the tika-bundle was started successfully. 
 However when extracting EXIG metadata from a jpeg image I got the following 
 exception.
 {code}
 java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException
   at 
 com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
   at 
 com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
   at 
 org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
   at 

Re: [jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle

2014-04-28 Thread Oleg Tikhonov
Hi Rupert,
agree about
javax.servlet;resolution:=optional,
javax.servlet.http;resolution:=optional,

Will check it out tomorrow.

Thanks !!!


On Mon, Apr 28, 2014 at 4:44 PM, Rupert Westenthaler (JIRA) j...@apache.org
 wrote:


  [
 https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]

 Rupert Westenthaler updated TIKA-1276:
 --

 Attachment: TIKA-1276_20140428_2_rwesten.diff

 Attached a revised patch (TIKA-1276_20140428_2_rwesten.diff) that makes
 the `javax.servlet` API an optional dependency

  Missing embedded dependencies in tika-bundle
  
 
  Key: TIKA-1276
  URL: https://issues.apache.org/jira/browse/TIKA-1276
  Project: Tika
   Issue Type: Bug
   Components: packaging
 Affects Versions: 1.5
  Environment: OSGI, Apache Felix via Apache Sling Launcher
 Reporter: Rupert Westenthaler
  Fix For: 1.6
 
  Attachments: TIKA-1276_20140423_rwesten.diff,
 TIKA-1276_20140428_2_rwesten.diff, TIKA-1276_20140428_rwesten.diff
 
 
  While updating from tika 1.2 to 1.5 I that the
 `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
  1. `com.uwyn:jhighlight:1.0` is not embedded
  Because of that installing the bundle results in the following exception
  {code}
  org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement
 [103.0] osgi.wiring.package;
 (osgi.wiring.package=com.uwyn.jhighlight.renderer))
  org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement
 [103.0] osgi.wiring.package;
 (osgi.wiring.package=com.uwyn.jhighlight.renderer)
at
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
at
 org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
at
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
at java.lang.Thread.run(Thread.java:744)
  {code}
  2. `org.ow2.asm:asm:4.1` is not embedded because
 `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and
 therefore the `Embed-Dependency` directive `asm` does not match any
 dependency.
  Because of that one do get the following exception (after fixing (1))
  {code}
  org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement
 [96.0] osgi.wiring.package;
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0
  org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement
 [96.0] osgi.wiring.package;
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0)))
at
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
at
 org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
at
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
at java.lang.Thread.run(Thread.java:744)
  {code}
  There are two possibilities to fix this (a) change the
 `Embed-Dependency` to `asm-debug-all` or adding a dependency to
 `org.ow2.asm:asm:4.1` to the tika-bundle pom file.
  3. `edu.ucar:netcdf:4.2-min` is not embedded
  Because of that one does get the following exception (after fixing (1)
 and (2))
  {code}
  org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
  org.osgi.framework.BundleException: Unresolved constraint in bundle
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)
at
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
at
 org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
at
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
at java.lang.Thread.run(Thread.java:744)
  {code}
  4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime
  After fixing the above issues the tika-bundle was started successfully.
 However when extracting EXIG metadata from a jpeg image I got the following
 exception.
  {code}
  java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException
at
 

[jira] [Created] (TIKA-1281) Additional XML type: application/x-xml

2014-04-28 Thread Avi (JIRA)
Avi created TIKA-1281:
-

 Summary: Additional XML type: application/x-xml
 Key: TIKA-1281
 URL: https://issues.apache.org/jira/browse/TIKA-1281
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.5
Reporter: Avi
Priority: Minor
 Fix For: 1.6


The following MediaType is not yet supported by Tika (not as a Media Type or an 
Alias): application/x-xml


I am no Media-Type expert, but if someone here at Tika is, then I suggest 
looking into it and if he sees fit then add it to the Tika Registry.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1282) Additional Gzip types:

2014-04-28 Thread Avi (JIRA)
Avi created TIKA-1282:
-

 Summary: Additional Gzip types: 
 Key: TIKA-1282
 URL: https://issues.apache.org/jira/browse/TIKA-1282
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.5
Reporter: Avi
Priority: Minor
 Fix For: 1.6


I found several GZip mime types (which were supported by our group till we 
began using Tika) which aren't listed in the Tika registry.


Now, I am not sure if they are legit or not, and I think that a Tika member 
will be able to investigate and decide if they should enter as mime types or 
aliases to gzip.

These are the types:
application/x-gunzip
application/gzipped
application/gzip-compressed
gzip/document



They can be found listed here:
http://mimeapplication.net/x-gunzip



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1282) Additional Gzip types:

2014-04-28 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983074#comment-13983074
 ] 

Nick Burch commented on TIKA-1282:
--

Are these all aliases of the main gzip type, or do they actually refer to 
different kinds of things?

 Additional Gzip types: 
 ---

 Key: TIKA-1282
 URL: https://issues.apache.org/jira/browse/TIKA-1282
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.5
Reporter: Avi
Priority: Minor
  Labels: gzip, mediaType, mime
 Fix For: 1.6


 I found several GZip mime types (which were supported by our group till we 
 began using Tika) which aren't listed in the Tika registry.
 Now, I am not sure if they are legit or not, and I think that a Tika member 
 will be able to investigate and decide if they should enter as mime types or 
 aliases to gzip.
 These are the types:
 application/x-gunzip
 application/gzipped
 application/gzip-compressed
 gzip/document
 They can be found listed here:
 http://mimeapplication.net/x-gunzip



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1282) Additional Gzip types:

2014-04-28 Thread Avi (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983079#comment-13983079
 ] 

Avi commented on TIKA-1282:
---

We used to handle them as aliases to the main GZ media type (Where we is the 
crawler-commons group).

But I am no expert, and we might have judged them wrong.

 Additional Gzip types: 
 ---

 Key: TIKA-1282
 URL: https://issues.apache.org/jira/browse/TIKA-1282
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.5
Reporter: Avi
Priority: Minor
  Labels: gzip, mediaType, mime
 Fix For: 1.6


 I found several GZip mime types (which were supported by our group till we 
 began using Tika) which aren't listed in the Tika registry.
 Now, I am not sure if they are legit or not, and I think that a Tika member 
 will be able to investigate and decide if they should enter as mime types or 
 aliases to gzip.
 These are the types:
 application/x-gunzip
 application/gzipped
 application/gzip-compressed
 gzip/document
 They can be found listed here:
 http://mimeapplication.net/x-gunzip



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1281) Additional XML type: application/x-xml

2014-04-28 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1281.
--

Resolution: Fixed

Added as an alias of application/xml in r1590667.

 Additional XML type: application/x-xml
 --

 Key: TIKA-1281
 URL: https://issues.apache.org/jira/browse/TIKA-1281
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.5
Reporter: Avi
Priority: Minor
  Labels: mediaType, xml
 Fix For: 1.6


 The following MediaType is not yet supported by Tika (not as a Media Type or 
 an Alias): application/x-xml
 I am no Media-Type expert, but if someone here at Tika is, then I suggest 
 looking into it and if he sees fit then add it to the Tika Registry.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties

2014-04-28 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1283:
-

 Summary: Add thumbnail as possible metadata item to 
TikaCoreProperties
 Key: TIKA-1283
 URL: https://issues.apache.org/jira/browse/TIKA-1283
 Project: Tika
  Issue Type: Improvement
  Components: metadata
Reporter: Tim Allison
Priority: Minor


TIKA-90 originally requested to add thumbnails to a document's metadata.

I'd like to have a unified way of determining whether an embedded 
document/resource is a thumbnail or a regular attachment.

With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
out more thumbnails than before.

I propose adding tika:thumbnail to the metadata of each embedded document.  
The consumer can then determine what to do with the embedded resource.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties

2014-04-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983163#comment-13983163
 ] 

Tim Allison commented on TIKA-1283:
---

I look forward to feedback on this issue.  I think there is a fairly clear 
distinction between thumbnail and attached image, but this might get murky.

On specific document types, there are some issues:
* RTF is easy
* ooxml now has a literal thumbnail, but there are also the emf and wmf files 
that do not have a literal thumbnail relationship...how do we handle these?
* pre-ooxml office...haven't dug deeply yet, but thumbnails there are emf and 
wmf...no?
* PDF...I'd also like to be able to distinguish between attached image files 
and embedded image files (TIKA-1268), but this is better handled as a separate 
issue?

*other formats??

 Add thumbnail as possible metadata item to TikaCoreProperties
 ---

 Key: TIKA-1283
 URL: https://issues.apache.org/jira/browse/TIKA-1283
 Project: Tika
  Issue Type: Improvement
  Components: metadata
Reporter: Tim Allison
Priority: Minor

 TIKA-90 originally requested to add thumbnails to a document's metadata.
 I'd like to have a unified way of determining whether an embedded 
 document/resource is a thumbnail or a regular attachment.
 With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
 out more thumbnails than before.
 I propose adding tika:thumbnail to the metadata of each embedded document.  
 The consumer can then determine what to do with the embedded resource.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties

2014-04-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1283:
--

Description: 
TIKA-90 originally requested to add thumbnails to a document's metadata.

I'd like to have a unified way of determining whether an embedded 
document/resource is a thumbnail or a regular attachment.

With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
out more thumbnails than before.

I propose adding tika:thumbnail to the metadata of each thumbnail image.  The 
consumer can then determine what to do with the embedded resource based on the 
metadata.

  was:
TIKA-90 originally requested to add thumbnails to a document's metadata.

I'd like to have a unified way of determining whether an embedded 
document/resource is a thumbnail or a regular attachment.

With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
out more thumbnails than before.

I propose adding tika:thumbnail to the metadata of each embedded document.  
The consumer can then determine what to do with the embedded resource.


 Add thumbnail as possible metadata item to TikaCoreProperties
 ---

 Key: TIKA-1283
 URL: https://issues.apache.org/jira/browse/TIKA-1283
 Project: Tika
  Issue Type: Improvement
  Components: metadata
Reporter: Tim Allison
Priority: Minor

 TIKA-90 originally requested to add thumbnails to a document's metadata.
 I'd like to have a unified way of determining whether an embedded 
 document/resource is a thumbnail or a regular attachment.
 With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
 out more thumbnails than before.
 I propose adding tika:thumbnail to the metadata of each thumbnail image.  
 The consumer can then determine what to do with the embedded resource based 
 on the metadata.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties

2014-04-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983191#comment-13983191
 ] 

Tim Allison commented on TIKA-1283:
---

Y, I absolutely agree with the distinction.  Is there a clean way of 
implementing that that wouldn't break too much?

Perhaps treat them as very different from the regular .get(String/Property...) 
in Metadata:
{noformat} 
byte[] tn = metadata.getThumbnailData()
{noformat}

One argument against this is that clients would then have to add the step of 
extracting thumbnails from the metadata and EmbeddedResourceHandler would no 
longer pull everything as elegantly as it does now (if the user wants all 
attachments and thumbnails).

Let me look into how hard it will be to associate a thumbnail with an embedded 
resource.  RTF is easy, but the microsoft/ooxml might be a bit messy.



 Add thumbnail as possible metadata item to TikaCoreProperties
 ---

 Key: TIKA-1283
 URL: https://issues.apache.org/jira/browse/TIKA-1283
 Project: Tika
  Issue Type: Improvement
  Components: metadata
Reporter: Tim Allison
Priority: Minor

 TIKA-90 originally requested to add thumbnails to a document's metadata.
 I'd like to have a unified way of determining whether an embedded 
 document/resource is a thumbnail or a regular attachment.
 With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
 out more thumbnails than before.
 I propose adding tika:thumbnail to the metadata of each thumbnail image.  
 The consumer can then determine what to do with the embedded resource based 
 on the metadata.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties

2014-04-28 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983434#comment-13983434
 ] 

Hong-Thai Nguyen commented on TIKA-1283:


+1 for me to create a thumbnail field in metadata Set.
- For OOXML, that's an item inside archive (see TIKA-1223). PowerPoint has 
always embedded thumbnail in Jpeg, but optional with docx  xlsx (available 
only when user check on 'save preview' option when saving document).
- For OLE Documents, see: http://poi.apache.org/hpsf/thumbnails.html. You can 
get thumbnail content from POI API:
{code}
static byte[] process(File docFile) throws Exception {
final HWPFDocumentCore wordDocument = AbstractWordUtils.loadDoc(docFile);
SummaryInformation summaryInformation = 
wordDocument.getSummaryInformation();
System.out.println(summaryInformation.getAuthor());
System.out.println(summaryInformation.getApplicationName() + : + 
summaryInformation.getTitle());
Thumbnail thumbnail = new Thumbnail(summaryInformation.getThumbnail());
System.out.println(thumbnail.getClipboardFormat());
System.out.println(thumbnail.getClipboardFormatTag());
return thumbnail.getThumbnailAsWMF();
  }
{code}
Unfortunately , there's an open bug on POI to get properly thumbnail content: 
https://issues.apache.org/bugzilla/show_bug.cgi?id=56194
docx, xlsx  ole formats, they are WMF  EMF formats. Quite difficult to handle 
these kind of images. But, this is out of our scope.


 Add thumbnail as possible metadata item to TikaCoreProperties
 ---

 Key: TIKA-1283
 URL: https://issues.apache.org/jira/browse/TIKA-1283
 Project: Tika
  Issue Type: Improvement
  Components: metadata
Reporter: Tim Allison
Priority: Minor

 TIKA-90 originally requested to add thumbnails to a document's metadata.
 I'd like to have a unified way of determining whether an embedded 
 document/resource is a thumbnail or a regular attachment.
 With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
 out more thumbnails than before.
 I propose adding tika:thumbnail to the metadata of each thumbnail image.  
 The consumer can then determine what to do with the embedded resource based 
 on the metadata.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Issue Comment Deleted] (TIKA-1274) ENVI header parser

2014-04-28 Thread Ann Burgess (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ann Burgess updated TIKA-1274:
--

Comment: was deleted

(was: Hey Chris,
How is your week looking? Want to set a time to do a chat?

I'm actually home sick today, out with a nasty cold that started yesterday.
 Later in the week might work best, so I'm lucid.
AB


On Mon, Apr 21, 2014 at 1:39 PM, Chris A. Mattmann (JIRA)




-- 
--
Ann Bryant Burgess, PhD

Postdoctoral Fellow
Computer Science Department
University of Southern California
Viterbi School of Engineering
Los Angeles, CA

Alaska Science Center/USGS
Anchorage, AK

Cell:  (585) 738-7549
Office:  (907) 786-7059
Fax:  (907) 786-7150
E-mail: anniebryant.burg...@gmail.com
Office Address: 4210 University Dr., Anchorage, AK 99508-4626
---
)

 ENVI header parser
 --

 Key: TIKA-1274
 URL: https://issues.apache.org/jira/browse/TIKA-1274
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.5
Reporter: Ann Burgess
Assignee: Chris A. Mattmann
  Labels: mime, newbie, parser, patch

 I have written a parser that extracts text and metadata from ENVI header 
 files, currently called at the command line as: 
 abryant:tika abryant$ java -classpath 
 annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar 
 org.apache.tika.cli.TikaCLI --metadata MOD09GA_test_header.hdr
Content-Encoding: ISO-8859-1
Content-Length: 818
Content-Type: application/envi.hdr
resourceName: MOD09GA_test_header.hdr
 abryant:tika abryant$ java -classpath 
 annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar 
 org.apache.tika.cli.TikaCLI --text MOD09GA_test_header.hdr
 ENVI
 description = {
   GEO-TIFF File Imported into ENVI [Fri May 25 14:06:23 2012]}
 samples = 2400
 lines   = 2400
 bands   = 7
 header offset = 0
 file type = ENVI Standard
 data type = 2
 interleave = bip
 sensor type = Unknown
 byte order = 0
 map info = {Sinusoidal, 1.5000, 1.5000, -10007091.3643, 5559289.2856, 
 4.6331271653e+02, 4.6331271653e+02, , units=Meters}
 projection info = {16, 6371007.2, 0.00, 0.0, 0.0, Sinusoidal, 
 units=Meters}
 coordinate system string = 
 {PROJCS[Sinusoidal,GEOGCS[GCS_ELLIPSE_BASED_1,DATUM[D_ELLIPSE_BASED_1,SPHEROID[S_ELLIPSE_BASED_1,6371007.181,0.0]],PRIMEM[Greenwich,0.0],UNIT[Degree,0.0174532925199433]],PROJECTION[Sinusoidal],PARAMETER[False_Easting,0.0],PARAMETER[False_Northing,0.0],PARAMETER[Central_Meridian,0.0],UNIT[Meter,1.0]]}
 wavelength units = Unknown
 __
 As a current non-certified committer, could someone enlighten me to the steps 
 needed to submit this new parser for review.  
 The parser is located in my directory structure as: 
 /users/annbryant/tika/tika/anniedev/src/main/java/edu/usc/sunset/abburgess/tika/EnviFileReader.class
 My custom mimetypes.xml file is located at: 
 /Users/annbryant/TIKA/tika/anniedev/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1274) ENVI header parser

2014-04-28 Thread Ann Burgess (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983500#comment-13983500
 ] 

Ann Burgess commented on TIKA-1274:
---

I've got the EnviHeaderParser and EnviHeaderParserTest (unit test) files now on 
github: https://github.com/abburgess/ENVIJava

I've run the unit test successfully in maven. If this looks good, I will create 
a patch for review.

 ENVI header parser
 --

 Key: TIKA-1274
 URL: https://issues.apache.org/jira/browse/TIKA-1274
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.5
Reporter: Ann Burgess
Assignee: Chris A. Mattmann
  Labels: mime, newbie, parser, patch

 I have written a parser that extracts text and metadata from ENVI header 
 files, currently called at the command line as: 
 abryant:tika abryant$ java -classpath 
 annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar 
 org.apache.tika.cli.TikaCLI --metadata MOD09GA_test_header.hdr
Content-Encoding: ISO-8859-1
Content-Length: 818
Content-Type: application/envi.hdr
resourceName: MOD09GA_test_header.hdr
 abryant:tika abryant$ java -classpath 
 annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar 
 org.apache.tika.cli.TikaCLI --text MOD09GA_test_header.hdr
 ENVI
 description = {
   GEO-TIFF File Imported into ENVI [Fri May 25 14:06:23 2012]}
 samples = 2400
 lines   = 2400
 bands   = 7
 header offset = 0
 file type = ENVI Standard
 data type = 2
 interleave = bip
 sensor type = Unknown
 byte order = 0
 map info = {Sinusoidal, 1.5000, 1.5000, -10007091.3643, 5559289.2856, 
 4.6331271653e+02, 4.6331271653e+02, , units=Meters}
 projection info = {16, 6371007.2, 0.00, 0.0, 0.0, Sinusoidal, 
 units=Meters}
 coordinate system string = 
 {PROJCS[Sinusoidal,GEOGCS[GCS_ELLIPSE_BASED_1,DATUM[D_ELLIPSE_BASED_1,SPHEROID[S_ELLIPSE_BASED_1,6371007.181,0.0]],PRIMEM[Greenwich,0.0],UNIT[Degree,0.0174532925199433]],PROJECTION[Sinusoidal],PARAMETER[False_Easting,0.0],PARAMETER[False_Northing,0.0],PARAMETER[Central_Meridian,0.0],UNIT[Meter,1.0]]}
 wavelength units = Unknown
 __
 As a current non-certified committer, could someone enlighten me to the steps 
 needed to submit this new parser for review.  
 The parser is located in my directory structure as: 
 /users/annbryant/tika/tika/anniedev/src/main/java/edu/usc/sunset/abburgess/tika/EnviFileReader.class
 My custom mimetypes.xml file is located at: 
 /Users/annbryant/TIKA/tika/anniedev/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1274) ENVI header parser

2014-04-28 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983523#comment-13983523
 ] 

Nick Burch commented on TIKA-1274:
--

Few quick bits:
 * There's a few files in that git repo that wouldn't normally be there - eg 
.class files and a /target/ directory
 * You seem to have some inconsistent indenting going on - IIRC Tika uses 4 
spaces no tabs

Secondly, you seem to be outputting the raw contents of the file as the textual 
part, but not doing any parsing of any parts into the metadata. At first glance 
(and I'm not an ENVI file format expert here!), I would've expected things like 
samples = 2400 to get mapped onto some sort of suitable metadata key/value 
pair

Are you able to dig out any documentation on the format of the ENVI header 
file? If so, we may be able to help suggest which bits of it may be best placed 
into the metadata object, and also what of that can use standard metadata keys 
+ which ones will need new metadata keys defining to be used

 ENVI header parser
 --

 Key: TIKA-1274
 URL: https://issues.apache.org/jira/browse/TIKA-1274
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.5
Reporter: Ann Burgess
Assignee: Chris A. Mattmann
  Labels: mime, newbie, parser, patch

 I have written a parser that extracts text and metadata from ENVI header 
 files, currently called at the command line as: 
 abryant:tika abryant$ java -classpath 
 annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar 
 org.apache.tika.cli.TikaCLI --metadata MOD09GA_test_header.hdr
Content-Encoding: ISO-8859-1
Content-Length: 818
Content-Type: application/envi.hdr
resourceName: MOD09GA_test_header.hdr
 abryant:tika abryant$ java -classpath 
 annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar 
 org.apache.tika.cli.TikaCLI --text MOD09GA_test_header.hdr
 ENVI
 description = {
   GEO-TIFF File Imported into ENVI [Fri May 25 14:06:23 2012]}
 samples = 2400
 lines   = 2400
 bands   = 7
 header offset = 0
 file type = ENVI Standard
 data type = 2
 interleave = bip
 sensor type = Unknown
 byte order = 0
 map info = {Sinusoidal, 1.5000, 1.5000, -10007091.3643, 5559289.2856, 
 4.6331271653e+02, 4.6331271653e+02, , units=Meters}
 projection info = {16, 6371007.2, 0.00, 0.0, 0.0, Sinusoidal, 
 units=Meters}
 coordinate system string = 
 {PROJCS[Sinusoidal,GEOGCS[GCS_ELLIPSE_BASED_1,DATUM[D_ELLIPSE_BASED_1,SPHEROID[S_ELLIPSE_BASED_1,6371007.181,0.0]],PRIMEM[Greenwich,0.0],UNIT[Degree,0.0174532925199433]],PROJECTION[Sinusoidal],PARAMETER[False_Easting,0.0],PARAMETER[False_Northing,0.0],PARAMETER[Central_Meridian,0.0],UNIT[Meter,1.0]]}
 wavelength units = Unknown
 __
 As a current non-certified committer, could someone enlighten me to the steps 
 needed to submit this new parser for review.  
 The parser is located in my directory structure as: 
 /users/annbryant/tika/tika/anniedev/src/main/java/edu/usc/sunset/abburgess/tika/EnviFileReader.class
 My custom mimetypes.xml file is located at: 
 /Users/annbryant/TIKA/tika/anniedev/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1274) ENVI header parser

2014-04-28 Thread Ann Burgess (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983597#comment-13983597
 ] 

Ann Burgess commented on TIKA-1274:
---

Hi Nick,

Thank you for the git repo tips.  I added the 'target' directory and I was
mimicking the directory structure of the tika build - consider it removed.
On that note, I'd appreciate any documentation on the dos and don'ts of
building a git repo for Tika or other Apache projects... if such
documentation exists.

As for the file contents, ENVI header
fileshttp://www.exelisvis.com/docs/ENVIHeaderFiles.htmlare plain
text documents. The contents of the ENVI header files are, in
fact, metadata for a corresponding data file, i.e. to read a file named
some_file.img, it requires the corresponding file some_file.img.hdr.  In
other words, because the entire contents of a some_file.img.hdr file
is metadata for some_file.img, the actual contents of the some_file.img.hdr
file do NOT describe the .hdr file itself, rather they describe the .img
file.  That is why I didn't think it appropriate to move parts of the 'raw
content' into metadata.  Does that make sense?  I'm also very open to how
this sort of thing is normally treated or to open a conversation about the
topic of how to treat one file type describing another file type.

Thanks for the input and any further suggestions.








-- 
--
Ann Bryant Burgess, PhD

Postdoctoral Fellow
Computer Science Department
University of Southern California
Viterbi School of Engineering
Los Angeles, CA

Alaska Science Center/USGS
Anchorage, AK

Cell:  (585) 738-7549
Office:  (907) 786-7059
Fax:  (907) 786-7150
E-mail: anniebryant.burg...@gmail.com
Office Address: 4210 University Dr., Anchorage, AK 99508-4626
---


 ENVI header parser
 --

 Key: TIKA-1274
 URL: https://issues.apache.org/jira/browse/TIKA-1274
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.5
Reporter: Ann Burgess
Assignee: Chris A. Mattmann
  Labels: mime, newbie, parser, patch

 I have written a parser that extracts text and metadata from ENVI header 
 files, currently called at the command line as: 
 abryant:tika abryant$ java -classpath 
 annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar 
 org.apache.tika.cli.TikaCLI --metadata MOD09GA_test_header.hdr
Content-Encoding: ISO-8859-1
Content-Length: 818
Content-Type: application/envi.hdr
resourceName: MOD09GA_test_header.hdr
 abryant:tika abryant$ java -classpath 
 annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar 
 org.apache.tika.cli.TikaCLI --text MOD09GA_test_header.hdr
 ENVI
 description = {
   GEO-TIFF File Imported into ENVI [Fri May 25 14:06:23 2012]}
 samples = 2400
 lines   = 2400
 bands   = 7
 header offset = 0
 file type = ENVI Standard
 data type = 2
 interleave = bip
 sensor type = Unknown
 byte order = 0
 map info = {Sinusoidal, 1.5000, 1.5000, -10007091.3643, 5559289.2856, 
 4.6331271653e+02, 4.6331271653e+02, , units=Meters}
 projection info = {16, 6371007.2, 0.00, 0.0, 0.0, Sinusoidal, 
 units=Meters}
 coordinate system string = 
 {PROJCS[Sinusoidal,GEOGCS[GCS_ELLIPSE_BASED_1,DATUM[D_ELLIPSE_BASED_1,SPHEROID[S_ELLIPSE_BASED_1,6371007.181,0.0]],PRIMEM[Greenwich,0.0],UNIT[Degree,0.0174532925199433]],PROJECTION[Sinusoidal],PARAMETER[False_Easting,0.0],PARAMETER[False_Northing,0.0],PARAMETER[Central_Meridian,0.0],UNIT[Meter,1.0]]}
 wavelength units = Unknown
 __
 As a current non-certified committer, could someone enlighten me to the steps 
 needed to submit this new parser for review.  
 The parser is located in my directory structure as: 
 /users/annbryant/tika/tika/anniedev/src/main/java/edu/usc/sunset/abburgess/tika/EnviFileReader.class
 My custom mimetypes.xml file is located at: 
 /Users/annbryant/TIKA/tika/anniedev/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1274) ENVI header parser

2014-04-28 Thread Ann Burgess (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983597#comment-13983597
 ] 

Ann Burgess edited comment on TIKA-1274 at 4/28/14 11:10 PM:
-

Hi Nick,

Thank you for the git repo tips.  I added the 'target' directory and I was
mimicking the directory structure of the tika build - consider it removed.
On that note, I'd appreciate any documentation on the dos and don'ts of
building a git repo for Tika or other Apache projects... if such
documentation exists.

As for the file contents, ENVI header
fileshttp://www.exelisvis.com/docs/ENVIHeaderFiles.htmlare plain
text documents. The contents of the ENVI header files are, in
fact, metadata for a corresponding data file, i.e. to read a file named
some_file.img, it requires the corresponding file some_file.img.hdr.  In
other words, because the entire contents of a some_file.img.hdr file
is metadata for some_file.img, the actual contents of the some_file.img.hdr
file do NOT describe the .hdr file itself, rather they describe the .img
file.  That is why I didn't think it appropriate to move parts of the 'raw
content' into metadata.  Does that make sense?  I'm also very open to how
this sort of thing is normally treated or to open a conversation about the
topic of how to treat one file type describing another file type.

Thanks for the input and any further suggestions.



was (Author: annieburgess):
Hi Nick,

Thank you for the git repo tips.  I added the 'target' directory and I was
mimicking the directory structure of the tika build - consider it removed.
On that note, I'd appreciate any documentation on the dos and don'ts of
building a git repo for Tika or other Apache projects... if such
documentation exists.

As for the file contents, ENVI header
fileshttp://www.exelisvis.com/docs/ENVIHeaderFiles.htmlare plain
text documents. The contents of the ENVI header files are, in
fact, metadata for a corresponding data file, i.e. to read a file named
some_file.img, it requires the corresponding file some_file.img.hdr.  In
other words, because the entire contents of a some_file.img.hdr file
is metadata for some_file.img, the actual contents of the some_file.img.hdr
file do NOT describe the .hdr file itself, rather they describe the .img
file.  That is why I didn't think it appropriate to move parts of the 'raw
content' into metadata.  Does that make sense?  I'm also very open to how
this sort of thing is normally treated or to open a conversation about the
topic of how to treat one file type describing another file type.

Thanks for the input and any further suggestions.








-- 
--
Ann Bryant Burgess, PhD

Postdoctoral Fellow
Computer Science Department
University of Southern California
Viterbi School of Engineering
Los Angeles, CA

Alaska Science Center/USGS
Anchorage, AK

Cell:  (585) 738-7549
Office:  (907) 786-7059
Fax:  (907) 786-7150
E-mail: anniebryant.burg...@gmail.com
Office Address: 4210 University Dr., Anchorage, AK 99508-4626
---


 ENVI header parser
 --

 Key: TIKA-1274
 URL: https://issues.apache.org/jira/browse/TIKA-1274
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.5
Reporter: Ann Burgess
Assignee: Chris A. Mattmann
  Labels: mime, newbie, parser, patch

 I have written a parser that extracts text and metadata from ENVI header 
 files, currently called at the command line as: 
 abryant:tika abryant$ java -classpath 
 annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar 
 org.apache.tika.cli.TikaCLI --metadata MOD09GA_test_header.hdr
Content-Encoding: ISO-8859-1
Content-Length: 818
Content-Type: application/envi.hdr
resourceName: MOD09GA_test_header.hdr
 abryant:tika abryant$ java -classpath 
 annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar 
 org.apache.tika.cli.TikaCLI --text MOD09GA_test_header.hdr
 ENVI
 description = {
   GEO-TIFF File Imported into ENVI [Fri May 25 14:06:23 2012]}
 samples = 2400
 lines   = 2400
 bands   = 7
 header offset = 0
 file type = ENVI Standard
 data type = 2
 interleave = bip
 sensor type = Unknown
 byte order = 0
 map info = {Sinusoidal, 1.5000, 1.5000, -10007091.3643, 5559289.2856, 
 4.6331271653e+02, 4.6331271653e+02, , units=Meters}
 projection info = {16, 6371007.2, 0.00, 0.0, 0.0, Sinusoidal, 
 units=Meters}
 coordinate system string = 
 

[jira] [Commented] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties

2014-04-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983813#comment-13983813
 ] 

Tim Allison commented on TIKA-1283:
---

[~thaichat04], thank you, as always.  By thumbnail, I'd also want to include 
images/icons of documents that are included only for display purposes.  For 
example, the icon image (image1.emf) in test-documents/EmbeddedPDF.docx doesn't 
have a relationship=thumbnail, but I'd want to include that as a thumbnail 
because it appears as an v:shape within a w:object.  

The point you make about the differences in handling of these by application is 
right on.  Each application links thumbnail images to the underlying data in 
different ways, and we'll have to go application by application to do this 
correctly (whether we go with this or TIKA-90)

I'm not held to the original proposal in this issue, and I like the clarity of 
TIKA-90 quite a bit.  Some other thoughts...the signature I proposed above 
won't work because a given image can have more than one thumbnail (at least for 
RTFs) and it misses metadata around the thumbnail image (such as mediaType of 
the thumbnail). 

 Add thumbnail as possible metadata item to TikaCoreProperties
 ---

 Key: TIKA-1283
 URL: https://issues.apache.org/jira/browse/TIKA-1283
 Project: Tika
  Issue Type: Improvement
  Components: metadata
Reporter: Tim Allison
Priority: Minor

 TIKA-90 originally requested to add thumbnails to a document's metadata.
 I'd like to have a unified way of determining whether an embedded 
 document/resource is a thumbnail or a regular attachment.
 With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
 out more thumbnails than before.
 I propose adding tika:thumbnail to the metadata of each thumbnail image.  
 The consumer can then determine what to do with the embedded resource based 
 on the metadata.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties

2014-04-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983813#comment-13983813
 ] 

Tim Allison edited comment on TIKA-1283 at 4/29/14 12:47 AM:
-

[~thaichat04], thank you, as always.  By thumbnail, I'd also want to include 
images/icons of documents that are included only for display purposes.  For 
example, the icon image (image1.emf) in test-documents/EmbeddedPDF.docx doesn't 
have a relationship=thumbnail, but I'd want to include that as a thumbnail 
because it appears as an v:shape within a w:object.  

The point you make about the differences in handling of these by application is 
right on.  Each application links thumbnail images to the underlying data in 
different ways, and we'll have to go application by application to do this 
correctly (whether we go with this or TIKA-90)

I'm not held to the original proposal in this issue, and I like the clarity of 
TIKA-90 quite a bit.  Some other thoughts...the signature I proposed above 
won't work because a given embedded resource can have more than one thumbnail 
(at least for RTFs) and it misses metadata around the thumbnail image (such as 
mediaType of the thumbnail). 


was (Author: talli...@mitre.org):
[~thaichat04], thank you, as always.  By thumbnail, I'd also want to include 
images/icons of documents that are included only for display purposes.  For 
example, the icon image (image1.emf) in test-documents/EmbeddedPDF.docx doesn't 
have a relationship=thumbnail, but I'd want to include that as a thumbnail 
because it appears as an v:shape within a w:object.  

The point you make about the differences in handling of these by application is 
right on.  Each application links thumbnail images to the underlying data in 
different ways, and we'll have to go application by application to do this 
correctly (whether we go with this or TIKA-90)

I'm not held to the original proposal in this issue, and I like the clarity of 
TIKA-90 quite a bit.  Some other thoughts...the signature I proposed above 
won't work because a given image can have more than one thumbnail (at least for 
RTFs) and it misses metadata around the thumbnail image (such as mediaType of 
the thumbnail). 

 Add thumbnail as possible metadata item to TikaCoreProperties
 ---

 Key: TIKA-1283
 URL: https://issues.apache.org/jira/browse/TIKA-1283
 Project: Tika
  Issue Type: Improvement
  Components: metadata
Reporter: Tim Allison
Priority: Minor

 TIKA-90 originally requested to add thumbnails to a document's metadata.
 I'd like to have a unified way of determining whether an embedded 
 document/resource is a thumbnail or a regular attachment.
 With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
 out more thumbnails than before.
 I propose adding tika:thumbnail to the metadata of each thumbnail image.  
 The consumer can then determine what to do with the embedded resource based 
 on the metadata.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties

2014-04-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983813#comment-13983813
 ] 

Tim Allison edited comment on TIKA-1283 at 4/29/14 1:15 AM:


[~thaichat04], thank you, as always.  By thumbnail, I was thinking of 
container files that hold an image of an attachment for display purposes as 
well as the attachment data.  For example, the icon image (image1.emf) in 
test-documents/EmbeddedPDF.docx...I'd want to include that as a thumbnail 
because it appears as an v:shape within a w:object.  I think what you're 
describing is a file that contains a thumbnail of itself.  Both types of 
thumbnails (self- vs other-) could be handled equivalently in either TIKA-90 or 
this proposal...for now they are both treated as attachments along with the 
other traditional attachments.

The point you make about the differences in handling of these by application is 
right on.  Each application links thumbnail images to the underlying data in 
different ways, and we'll have to go application by application to do this 
correctly (whether we go with this or TIKA-90)

I'm not held to the original proposal in this issue, and I like the clarity of 
TIKA-90 quite a bit.  Some other thoughts...the signature I proposed above 
won't work because a given embedded resource can have more than one thumbnail 
(at least for RTFs) and it misses metadata around the thumbnail image (such as 
mediaType of the thumbnail). 


was (Author: talli...@mitre.org):
[~thaichat04], thank you, as always.  By thumbnail, I'd also want to include 
images/icons of documents that are included only for display purposes.  For 
example, the icon image (image1.emf) in test-documents/EmbeddedPDF.docx doesn't 
have a relationship=thumbnail, but I'd want to include that as a thumbnail 
because it appears as an v:shape within a w:object.  

The point you make about the differences in handling of these by application is 
right on.  Each application links thumbnail images to the underlying data in 
different ways, and we'll have to go application by application to do this 
correctly (whether we go with this or TIKA-90)

I'm not held to the original proposal in this issue, and I like the clarity of 
TIKA-90 quite a bit.  Some other thoughts...the signature I proposed above 
won't work because a given embedded resource can have more than one thumbnail 
(at least for RTFs) and it misses metadata around the thumbnail image (such as 
mediaType of the thumbnail). 

 Add thumbnail as possible metadata item to TikaCoreProperties
 ---

 Key: TIKA-1283
 URL: https://issues.apache.org/jira/browse/TIKA-1283
 Project: Tika
  Issue Type: Improvement
  Components: metadata
Reporter: Tim Allison
Priority: Minor

 TIKA-90 originally requested to add thumbnails to a document's metadata.
 I'd like to have a unified way of determining whether an embedded 
 document/resource is a thumbnail or a regular attachment.
 With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
 out more thumbnails than before.
 I propose adding tika:thumbnail to the metadata of each thumbnail image.  
 The consumer can then determine what to do with the embedded resource based 
 on the metadata.



--
This message was sent by Atlassian JIRA
(v6.2#6252)