buildbot failure in ASF Buildbot on tika-trunk

2012-07-01 Thread buildbot
The Buildbot has detected a new failure on builder tika-trunk while building 
ASF Buildbot.
Full details are available at:
 http://ci.apache.org/builders/tika-trunk/builds/887

Buildbot URL: http://ci.apache.org/

Buildslave for this Build: portunus_ubuntu

Build Reason: scheduler
Build Source Stamp: [branch tika/trunk] 1355868
Blamelist: jukka

BUILD FAILED: failed compile

sincerely,
 -The Buildbot





JAX-RS overhead in tika-server

2012-07-01 Thread Jukka Zitting
Hi,

I looked at tika-server in a bit more detail, and I'm a bit concerned
about the dependency overhead it needs for the JAX-RS support:

  +- org.apache.cxf:cxf-rt-frontend-jaxrs:jar:2.5.2
 +- org.apache.cxf:cxf-common-utilities:jar:2.5.2
 |  +- org.apache.ws.xmlschema:xmlschema-core:jar:2.0.1
 |  \- org.codehaus.woodstox:woodstox-core-asl:jar:4.1.1
 |   \- org.codehaus.woodstox:stax2-api:jar:3.1.1
 +- org.apache.cxf:cxf-api:jar:2.5.2
 |  +- org.apache.neethi:neethi:jar:3.0.1
 |  \- wsdl4j:wsdl4j:jar:1.6.2
 +- org.apache.cxf:cxf-rt-core:jar:2.5.2
 |  +- com.sun.xml.bind:jaxb-impl:jar:2.2.4-1
 |  \- org.apache.geronimo.specs:geronimo-javamail_1.4_spec:jar:1.7.1
 +- org.springframework:spring-core:jar:3.0.6.RELEASE
 |  \- org.springframework:spring-asm:jar:3.0.6.RELEASE
 +- javax.ws.rs:jsr311-api:jar:1.1.1
 +- org.apache.cxf:cxf-rt-bindings-xml:jar:2.5.2
 +- org.apache.cxf:cxf-rt-transports-http:jar:2.5.2
 |  +- org.apache.cxf:cxf-rt-transports-common:jar:2.5.2
 |  \- org.springframework:spring-web:jar:3.0.6.RELEASE
 |   +- aopalliance:aopalliance:jar:1.0
 |   +- org.springframework:spring-beans:jar:3.0.6.RELEASE
 |   \- org.springframework:spring-context:jar:3.0.6.RELEASE
 |+- org.springframework:spring-aop:jar:3.0.6.RELEASE
 |\- org.springframework:spring-expression:jar:3.0.6.RELEASE
 \- org.codehaus.jettison:jettison:jar:1.3.1
  +- org.apache.cxf:cxf-rt-transports-http-jetty:jar:2.5.2
 +- org.eclipse.jetty:jetty-server:jar:7.5.4.v20111024
 |  +- org.eclipse.jetty:jetty-continuation:jar:7.5.4.v20111024
 |  \- org.eclipse.jetty:jetty-http:jar:7.5.4.v20111024
 |   \- org.eclipse.jetty:jetty-io:jar:7.5.4.v20111024
 |\- org.eclipse.jetty:jetty-util:jar:7.5.4.v20111024
 +- org.eclipse.jetty:jetty-security:jar:7.5.4.v20111024
 \- org.apache.geronimo.specs:geronimo-servlet_2.5_spec:jar:1.1.2

That's about 7MB of middleware code. Do we really need all this? If
yes, who's going to review the licensing of all these dependencies and
come up with appropriate LICENSE/NOTICE files to include in the
tika-server jar?

The services exposed by tika-server are pretty simple and
straightforward, so I'm wondering if we could just replace all of the
above with just an embedded Jetty server, or even just the HttpCore
library [1].

[1] http://hc.apache.org/httpcomponents-core-ga/

BR,

Jukka Zitting


Re: JAX-RS overhead in tika-server

2012-07-01 Thread Mattmann, Chris A (388J)
Hey Jukka,

On Jul 1, 2012, at 5:09 AM, Jukka Zitting wrote:

> Hi,
> 
> I looked at tika-server in a bit more detail, and I'm a bit concerned
> about the dependency overhead it needs for the JAX-RS support:
> 
>  +- org.apache.cxf:cxf-rt-frontend-jaxrs:jar:2.5.2
> +- org.apache.cxf:cxf-common-utilities:jar:2.5.2
> |  +- org.apache.ws.xmlschema:xmlschema-core:jar:2.0.1
> |  \- org.codehaus.woodstox:woodstox-core-asl:jar:4.1.1
> |   \- org.codehaus.woodstox:stax2-api:jar:3.1.1
> +- org.apache.cxf:cxf-api:jar:2.5.2
> |  +- org.apache.neethi:neethi:jar:3.0.1
> |  \- wsdl4j:wsdl4j:jar:1.6.2
> +- org.apache.cxf:cxf-rt-core:jar:2.5.2
> |  +- com.sun.xml.bind:jaxb-impl:jar:2.2.4-1
> |  \- org.apache.geronimo.specs:geronimo-javamail_1.4_spec:jar:1.7.1
> +- org.springframework:spring-core:jar:3.0.6.RELEASE
> |  \- org.springframework:spring-asm:jar:3.0.6.RELEASE
> +- javax.ws.rs:jsr311-api:jar:1.1.1
> +- org.apache.cxf:cxf-rt-bindings-xml:jar:2.5.2
> +- org.apache.cxf:cxf-rt-transports-http:jar:2.5.2
> |  +- org.apache.cxf:cxf-rt-transports-common:jar:2.5.2
> |  \- org.springframework:spring-web:jar:3.0.6.RELEASE
> |   +- aopalliance:aopalliance:jar:1.0
> |   +- org.springframework:spring-beans:jar:3.0.6.RELEASE
> |   \- org.springframework:spring-context:jar:3.0.6.RELEASE
> |+- org.springframework:spring-aop:jar:3.0.6.RELEASE
> |\- org.springframework:spring-expression:jar:3.0.6.RELEASE
> \- org.codehaus.jettison:jettison:jar:1.3.1
>  +- org.apache.cxf:cxf-rt-transports-http-jetty:jar:2.5.2
> +- org.eclipse.jetty:jetty-server:jar:7.5.4.v20111024
> |  +- org.eclipse.jetty:jetty-continuation:jar:7.5.4.v20111024
> |  \- org.eclipse.jetty:jetty-http:jar:7.5.4.v20111024
> |   \- org.eclipse.jetty:jetty-io:jar:7.5.4.v20111024
> |\- org.eclipse.jetty:jetty-util:jar:7.5.4.v20111024
> +- org.eclipse.jetty:jetty-security:jar:7.5.4.v20111024
> \- org.apache.geronimo.specs:geronimo-servlet_2.5_spec:jar:1.1.2
> 
> That's about 7MB of middleware code. Do we really need all this?

That's a good question. My goal in moving us away from the Jersey 
code that was doing this was to move us away from Sun licensed code,
and on to Apache CXF, which I knew from OODT provided JAX-RS 
support. Also I wanted to consume the vetted Apache CXF code which
I figured would be a ton safer license wise than Jersey.

Sergey Beryozkin (who I'm CC'ing on this email since I'm not sure
he's subscribed to dev@) helped by providing guidance on the CXF
side while I was working on this with Max. If you scope out [1], Max
brought up the large # of dependencies too and Sergey's response
was that in 2.6 there are only a few required dependencies:

*
[INFO] +- org.apache.cxf:cxf-api:jar:2.6.0-SNAPSHOT:compile
[INFO] | +- org.codehaus.woodstox:woodstox-core-asl:jar:4.1.2:runtime
[INFO] | | - org.codehaus.woodstox:stax2-api:jar:3.1.1:runtime
[INFO] | +- org.apache.ws.xmlschema:xmlschema-core:jar:2.0.1:compile
[INFO] | +- 
org.apache.geronimo.specs:geronimo-javamail_1.4_spec:jar:1.7.1:compile
[INFO] | - wsdl4j:wsdl4j:jar:1.6.2:compile
[INFO] +- org.apache.cxf:cxf-rt-core:jar:2.6.0-SNAPSHOT:compile
[INFO] | - com.sun.xml.bind:jaxb-impl:jar:2.1.13:compile
**

Maybe we should try upgrading to 2.6 if it's out?


> If
> yes, who's going to review the licensing of all these dependencies and
> come up with appropriate LICENSE/NOTICE files to include in the
> tika-server jar?

These are CXF dependencies, which I'm sure there are relevant 
entries in Apache CXF for them, no? And, aren't we eating our own 
dog food here for this?

> 
> The services exposed by tika-server are pretty simple and
> straightforward, so I'm wondering if we could just replace all of the
> above with just an embedded Jetty server, or even just the HttpCore
> library [1].
> 
> [1] http://hc.apache.org/httpcomponents-core-ga/

Does HTTP core provide JAX-RS support?

Thanks!

Cheers,
Chris

[1] http://s.apache.org/0I

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: svn commit: r1355877 - in /tika/trunk: ./ tika-dll/ tika-dll/src/ tika-dll/src/main/ tika-dll/src/main/csharp/ tika-dll/src/main/csharp/Apache/

2012-07-01 Thread Mattmann, Chris A (388J)
WOW nice Jukka, you did it!

Cheers,
Chris

On Jul 1, 2012, at 6:04 AM, 
  wrote:

> Author: jukka
> Date: Sun Jul  1 13:04:00 2012
> New Revision: 1355877
> 
> URL: http://svn.apache.org/viewvc?rev=1355877&view=rev
> Log:
> TIKA-773: .NET version of Tika
> 
> Add a basic Tika.dll build
> 
> Added:
>tika/trunk/tika-dll/
>tika/trunk/tika-dll/.gitignore
>tika/trunk/tika-dll/AssemblyInfo.cs
>tika/trunk/tika-dll/Tika.csproj
>tika/trunk/tika-dll/Tika.sln
>tika/trunk/tika-dll/pom.xml
>tika/trunk/tika-dll/src/
>tika/trunk/tika-dll/src/main/
>tika/trunk/tika-dll/src/main/csharp/
>tika/trunk/tika-dll/src/main/csharp/Apache/
>tika/trunk/tika-dll/src/main/csharp/Apache/Tika.cs
[...snip...]


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[jira] [Created] (TIKA-944) Extend tika-server API to be consistent with tika-app CLI

2012-07-01 Thread Jason Judge (JIRA)
Jason Judge created TIKA-944:


 Summary: Extend tika-server API to be consistent with tika-app CLI
 Key: TIKA-944
 URL: https://issues.apache.org/jira/browse/TIKA-944
 Project: Tika
  Issue Type: New Feature
Affects Versions: 1.1
 Environment: Any
Reporter: Jason Judge


The tika-server API (web service) provides a limited set of functionality 
compared to the tika-app command-line version. Notable things missing are:

1. Language recognition.
2. Output in various formats (JSON for metadata, XHTML for the extracted text).

Those are the two main things that would be useful to me, but ideally the 
server should be able to provide all the functionality that the command-line 
app does, taking the command-line as the model to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-944) Extend tika-server API to be consistent with tika-app CLI

2012-07-01 Thread Jason Judge (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404765#comment-13404765
 ] 

Jason Judge commented on TIKA-944:
--

I'm not a java developer, and don't have a build environment set up. I can help 
test under a Windows environment (with a shell script installed) or Linux web 
server. If setting up a development environment to compile and test the source 
is simple, I'm happy to do that, but just don't have the time at present to go 
down the route of learning java and maven.

> Extend tika-server API to be consistent with tika-app CLI
> -
>
> Key: TIKA-944
> URL: https://issues.apache.org/jira/browse/TIKA-944
> Project: Tika
>  Issue Type: New Feature
>Affects Versions: 1.1
> Environment: Any
>Reporter: Jason Judge
>  Labels: exposed-functionality, tika-server
>
> The tika-server API (web service) provides a limited set of functionality 
> compared to the tika-app command-line version. Notable things missing are:
> 1. Language recognition.
> 2. Output in various formats (JSON for metadata, XHTML for the extracted 
> text).
> Those are the two main things that would be useful to me, but ideally the 
> server should be able to provide all the functionality that the command-line 
> app does, taking the command-line as the model to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: svn commit: r1355947 - /tika/trunk/tika-parent/pom.xml

2012-07-01 Thread Mattmann, Chris A (388J)
Great job Ray!!

Cheers,
Chris

On Jul 1, 2012, at 9:39 AM, 
  wrote:

> Author: rgauss
> Date: Sun Jul  1 16:39:29 2012
> New Revision: 1355947
> 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[jira] [Commented] (TIKA-944) Extend tika-server API to be consistent with tika-app CLI

2012-07-01 Thread Jason Judge (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404769#comment-13404769
 ] 

Jason Judge commented on TIKA-944:
--

I'm not sure if you need more specific details, such as suggested paths, but 
please say if you do. Obviously I'm not qualified enough to see the complete 
scope of what Tika is capable of, so don't want to stomp over any paths that 
would be more suited to other functionality.

> Extend tika-server API to be consistent with tika-app CLI
> -
>
> Key: TIKA-944
> URL: https://issues.apache.org/jira/browse/TIKA-944
> Project: Tika
>  Issue Type: New Feature
>Affects Versions: 1.1
> Environment: Any
>Reporter: Jason Judge
>  Labels: exposed-functionality, tika-server
>
> The tika-server API (web service) provides a limited set of functionality 
> compared to the tika-app command-line version. Notable things missing are:
> 1. Language recognition.
> 2. Output in various formats (JSON for metadata, XHTML for the extracted 
> text).
> Those are the two main things that would be useful to me, but ideally the 
> server should be able to provide all the functionality that the command-line 
> app does, taking the command-line as the model to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Comment Edited] (TIKA-944) Extend tika-server API to be consistent with tika-app CLI

2012-07-01 Thread Jason Judge (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404769#comment-13404769
 ] 

Jason Judge edited comment on TIKA-944 at 7/1/12 5:01 PM:
--

I'm not sure if you need more specific details, such as suggested paths, but 
please say if you do. Obviously I'm not qualified enough to see the complete 
scope of what Tika is capable of, so don't want to stomp over any paths that 
would be more suited to other functionality that I've never seen.

  was (Author: judgej):
I'm not sure if you need more specific details, such as suggested paths, 
but please say if you do. Obviously I'm not qualified enough to see the 
complete scope of what Tika is capable of, so don't want to stomp over any 
paths that would be more suited to other functionality.
  
> Extend tika-server API to be consistent with tika-app CLI
> -
>
> Key: TIKA-944
> URL: https://issues.apache.org/jira/browse/TIKA-944
> Project: Tika
>  Issue Type: New Feature
>Affects Versions: 1.1
> Environment: Any
>Reporter: Jason Judge
>  Labels: exposed-functionality, tika-server
>
> The tika-server API (web service) provides a limited set of functionality 
> compared to the tika-app command-line version. Notable things missing are:
> 1. Language recognition.
> 2. Output in various formats (JSON for metadata, XHTML for the extracted 
> text).
> Those are the two main things that would be useful to me, but ideally the 
> server should be able to provide all the functionality that the command-line 
> app does, taking the command-line as the model to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: JAX-RS overhead in tika-server

2012-07-01 Thread Jukka Zitting
Hi,

On Sun, Jul 1, 2012 at 6:27 PM, Mattmann, Chris A (388J)
 wrote:
> On Jul 1, 2012, at 5:09 AM, Jukka Zitting wrote:
> Sergey Beryozkin (who I'm CC'ing on this email since I'm not sure
> he's subscribed to dev@) helped by providing guidance on the CXF
> side while I was working on this with Max. If you scope out [1], Max
> brought up the large # of dependencies too and Sergey's response
> was that in 2.6 there are only a few required dependencies:

OK, sounds much better already.

> These are CXF dependencies, which I'm sure there are relevant
> entries in Apache CXF for them, no? And, aren't we eating our own
> dog food here for this?

That's right, but we still need to look into the relevant licensing
details of CXF and all it's dependencies to see if there's any
non-ALv2 code that needs to be mentioned also in our LICENSE/NOTICE
files since we're embedding all that code inside the tika-server jar.

> Does HTTP core provide JAX-RS support?

No. I'm just wondering if we even need JAX-RS in the first place as
the exposed services are fairly simple.

BR,

Jukka Zitting


[jira] [Updated] (TIKA-872) Tika --extract fails for RTF

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-872:
---

Fix Version/s: (was: 1.2)
   1.3

- push to 1.3

> Tika --extract fails for RTF
> 
>
> Key: TIKA-872
> URL: https://issues.apache.org/jira/browse/TIKA-872
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.0
> Environment: Windows 7 with Java v1.6
>Reporter: Albert L.
> Fix For: 1.3
>
> Attachments: embedded.rtf.zip
>
>
> A file that is embedded in an RTF file doesn't get extracted to disk.
> To "embed" a file into an RTF, simply drag-drop it into an RTF document when 
> using MS-Word 2010.  It will then create an EMF of the embedded file's 
> preview.
> See attached file "embedded.rtf.zip" for an example input file that fails 
> with Tika v1.0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-872) Tika --extract fails for RTF

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-872:
---


- push to 1.3

> Tika --extract fails for RTF
> 
>
> Key: TIKA-872
> URL: https://issues.apache.org/jira/browse/TIKA-872
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.0
> Environment: Windows 7 with Java v1.6
>Reporter: Albert L.
> Fix For: 1.3
>
> Attachments: embedded.rtf.zip
>
>
> A file that is embedded in an RTF file doesn't get extracted to disk.
> To "embed" a file into an RTF, simply drag-drop it into an RTF document when 
> using MS-Word 2010.  It will then create an EMF of the embedded file's 
> preview.
> See attached file "embedded.rtf.zip" for an example input file that fails 
> with Tika v1.0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-774) ExifTool Parser

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-774:
---


- push to 1.3

> ExifTool Parser
> ---
>
> Key: TIKA-774
> URL: https://issues.apache.org/jira/browse/TIKA-774
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.0
> Environment: Requires be installed 
> (http://www.sno.phy.queensu.ca/~phil/exiftool/)
>Reporter: Ray Gauss II
>  Labels: features, newbie, patch,
> Fix For: 1.3
>
> Attachments: testJPEG_IPTC_EXT.jpg, 
> tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt
>
>
> Adds an external parser that calls ExifTool to extract extended metadata 
> fields from images and other content types.
> In the core project:
> An ExifTool interface is added which contains Property objects that define 
> the metadata fields available.
> An additional Property constructor for internalTextBag type.
> In the parsers project:
> An ExiftoolMetadataExtractor is added which does the work of calling ExifTool 
> on the command line and mapping the response to tika metadata fields.  This 
> extractor could be called instead of or in addition to the existing 
> ImageMetadataExtractor and JempboxExtractor under TiffParser and/or 
> JpegParser but those have not been changed at this time.
> An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
> An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool 
> metadata fields to existing tika and Drew Noakes metadata fields if enabled.
> An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag 
> implementations in XML files.
> An ExifToolParserTest is added which tests several expected XMP and IPTC 
> metadata values in testJPEG_IPTC_EXT.jpg.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-774) ExifTool Parser

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-774:
---

Fix Version/s: (was: 1.2)
   1.3

- push to 1.3

> ExifTool Parser
> ---
>
> Key: TIKA-774
> URL: https://issues.apache.org/jira/browse/TIKA-774
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.0
> Environment: Requires be installed 
> (http://www.sno.phy.queensu.ca/~phil/exiftool/)
>Reporter: Ray Gauss II
>  Labels: features, newbie, patch,
> Fix For: 1.3
>
> Attachments: testJPEG_IPTC_EXT.jpg, 
> tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt
>
>
> Adds an external parser that calls ExifTool to extract extended metadata 
> fields from images and other content types.
> In the core project:
> An ExifTool interface is added which contains Property objects that define 
> the metadata fields available.
> An additional Property constructor for internalTextBag type.
> In the parsers project:
> An ExiftoolMetadataExtractor is added which does the work of calling ExifTool 
> on the command line and mapping the response to tika metadata fields.  This 
> extractor could be called instead of or in addition to the existing 
> ImageMetadataExtractor and JempboxExtractor under TiffParser and/or 
> JpegParser but those have not been changed at this time.
> An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
> An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool 
> metadata fields to existing tika and Drew Noakes metadata fields if enabled.
> An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag 
> implementations in XML files.
> An ExifToolParserTest is added which tests several expected XMP and IPTC 
> metadata values in testJPEG_IPTC_EXT.jpg.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---


- push to 1.3

> Some parsers produce non-well-formed XHTML SAX events
> -
>
> Key: TIKA-715
> URL: https://issues.apache.org/jira/browse/TIKA-715
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.10
>Reporter: Michael McCandless
> Fix For: 1.3
>
> Attachments: TIKA-715.patch
>
>
> With TIKA-683 I committed simple, commented out code to
> SafeContentHandler, to verify that the SAX events produced by the
> parser have valid (matched) tags.  Ie, each startElement("foo") is
> matched by the closing endElement("foo").
> I only did basic nesting test, plus checking that  is never
> embedded inside another ; we could strengthen this further to check
> that all tags only appear in valid parents...
> I was able to use this to fix issues with the new RTF parser
> (TIKA-683), but I was surprised that some other parsers failed the new
> asserts.
> It could be these are relatively minor offenses (eg closing a table
> w/o closing the tr) and we need not do anything here... but I think
> it'd be cleaner if all our parsers produced matched, well-formed XHTML
> events.
> I haven't looked into any of these... it could be they are easy to fix.
> Failures:
> {noformat}
> testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
> Time elapsed: 0.032 sec  <<< ERROR!
> java.lang.AssertionError: end tag=body with no startElement
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>   at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
> testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
> 0.116 sec  <<< ERROR!
> java.lang.AssertionError: mismatched elements open=tr close=table
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
>   at 
> org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>   at 
> org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
>   at 
> org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
> testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
> 0.025 sec  <<< 

[jira] [Updated] (TIKA-906) Headers, footers, and footnotes not extracted from Pages documents

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-906:
---


- push to 1.3

> Headers, footers, and footnotes not extracted from Pages documents
> --
>
> Key: TIKA-906
> URL: https://issues.apache.org/jira/browse/TIKA-906
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.0
> Environment: Windows 7
>Reporter: Gabriel Valencia
>  Labels: iWork
> Fix For: 1.3
>
> Attachments: testPagesHeadersFootersFootnotesJIRA.pages
>
>
> Tika does not extract anything from the header or footer area and also does 
> not extract footnotes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-906) Headers, footers, and footnotes not extracted from Pages documents

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-906:
---

Fix Version/s: (was: 1.2)
   1.3

- push to 1.3

> Headers, footers, and footnotes not extracted from Pages documents
> --
>
> Key: TIKA-906
> URL: https://issues.apache.org/jira/browse/TIKA-906
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.0
> Environment: Windows 7
>Reporter: Gabriel Valencia
>  Labels: iWork
> Fix For: 1.3
>
> Attachments: testPagesHeadersFootersFootnotesJIRA.pages
>
>
> Tika does not extract anything from the header or footer area and also does 
> not extract footnotes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---

Fix Version/s: (was: 1.2)
   1.3

- push to 1.3

> Some parsers produce non-well-formed XHTML SAX events
> -
>
> Key: TIKA-715
> URL: https://issues.apache.org/jira/browse/TIKA-715
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.10
>Reporter: Michael McCandless
> Fix For: 1.3
>
> Attachments: TIKA-715.patch
>
>
> With TIKA-683 I committed simple, commented out code to
> SafeContentHandler, to verify that the SAX events produced by the
> parser have valid (matched) tags.  Ie, each startElement("foo") is
> matched by the closing endElement("foo").
> I only did basic nesting test, plus checking that  is never
> embedded inside another ; we could strengthen this further to check
> that all tags only appear in valid parents...
> I was able to use this to fix issues with the new RTF parser
> (TIKA-683), but I was surprised that some other parsers failed the new
> asserts.
> It could be these are relatively minor offenses (eg closing a table
> w/o closing the tr) and we need not do anything here... but I think
> it'd be cleaner if all our parsers produced matched, well-formed XHTML
> events.
> I haven't looked into any of these... it could be they are easy to fix.
> Failures:
> {noformat}
> testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
> Time elapsed: 0.032 sec  <<< ERROR!
> java.lang.AssertionError: end tag=body with no startElement
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>   at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
> testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
> 0.116 sec  <<< ERROR!
> java.lang.AssertionError: mismatched elements open=tr close=table
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
>   at 
> org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>   at 
> org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
>   at 
> org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
> testMultipart(org.apache.tika.parser

[jira] [Updated] (TIKA-757) Address TODOs when we upgrade to next POI release (3.8 beta 5)

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-757:
---


- push to 1.3

> Address TODOs when we upgrade to next POI release (3.8 beta 5)
> --
>
> Key: TIKA-757
> URL: https://issues.apache.org/jira/browse/TIKA-757
> Project: Tika
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 1.3
>
>
> I'm opening a blanket issue to remind us all to address the TODOs in the 
> sources for when we upgrade to the next POI.
> I think this (a single blanket issue) is better than keeping separate issues 
> open even though they are technically fixed?
> For example, I've committed TIKA-753 (speedups for embedded office docs), yet 
> it included some TODOs for further speedups possible once we upgrade POI.  
> Rather than keeping TIKA-753 (and others like it) open, I think we should 
> resolve them and let this issue cover all the TODOs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-868) TXT parser does not honour the specified encoding

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-868:
---


- push to 1.3

> TXT parser does not honour the specified encoding
> -
>
> Key: TIKA-868
> URL: https://issues.apache.org/jira/browse/TIKA-868
> Project: Tika
>  Issue Type: Bug
>Reporter: Daniel Bonniot de Ruisselet
> Fix For: 1.3
>
>
> With input text "Indanyl", the encoding is recognized as IBM500, even when 
> "UTF-8" is specified explicitly.
> I would argue that detection should only be used when the declared 
> information is incorrect (saving time and avoiding wrong detection), as 
> proposed by Ken Krugler in TIKA-539.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-868) TXT parser does not honour the specified encoding

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-868:
---

Fix Version/s: (was: 1.2)
   1.3

- push to 1.3

> TXT parser does not honour the specified encoding
> -
>
> Key: TIKA-868
> URL: https://issues.apache.org/jira/browse/TIKA-868
> Project: Tika
>  Issue Type: Bug
>Reporter: Daniel Bonniot de Ruisselet
> Fix For: 1.3
>
>
> With input text "Indanyl", the encoding is recognized as IBM500, even when 
> "UTF-8" is specified explicitly.
> I would argue that detection should only be used when the declared 
> information is incorrect (saving time and avoiding wrong detection), as 
> proposed by Ken Krugler in TIKA-539.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-757) Address TODOs when we upgrade to next POI release (3.8 beta 5)

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-757:
---

Fix Version/s: (was: 1.2)
   1.3

- push to 1.3

> Address TODOs when we upgrade to next POI release (3.8 beta 5)
> --
>
> Key: TIKA-757
> URL: https://issues.apache.org/jira/browse/TIKA-757
> Project: Tika
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 1.3
>
>
> I'm opening a blanket issue to remind us all to address the TODOs in the 
> sources for when we upgrade to the next POI.
> I think this (a single blanket issue) is better than keeping separate issues 
> open even though they are technically fixed?
> For example, I've committed TIKA-753 (speedups for embedded office docs), yet 
> it included some TODOs for further speedups possible once we upgrade POI.  
> Rather than keeping TIKA-753 (and others like it) open, I think we should 
> resolve them and let this issue cover all the TODOs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-891:
---


- push to 1.3

> Use POST in addition to PUT on method calls in tika-server
> --
>
> Key: TIKA-891
> URL: https://issues.apache.org/jira/browse/TIKA-891
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Trivial
> Fix For: 1.3
>
>
> Per Jukka's email:
> http://s.apache.org/uR
> It would be a better use of REST/HTTP "verbs" to use POST to put content to a 
> resource where we don't intend to store that content (which is the 
> implication of PUT). Max suggested adding:
> {code}
> @POST
> {code}
> annotations to the methods we are currently exposing using PUT to take care 
> of this. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-539) Encoding detection is too biased by encoding in meta tag

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-539:
---


- push to 1.3

> Encoding detection is too biased by encoding in meta tag
> 
>
> Key: TIKA-539
> URL: https://issues.apache.org/jira/browse/TIKA-539
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 0.8, 0.9, 0.10
>Reporter: Reinhard Schwab
>Assignee: Ken Krugler
> Fix For: 1.3
>
> Attachments: TIKA-539.patch, TIKA-539_2.patch
>
>
> if the encoding in the meta tag is wrong, this encoding is detected,
> even if there is the right encoding set in metadata before(which can be  from 
> http response header).
> test code to reproduce:
> static String content = "\n"
>   + " content=\"application/xhtml+xml; charset=iso-8859-1\" />"
>   + "Über den Wolken\n";
>   /**
>* @param args
>* @throws IOException
>* @throws TikaException
>* @throws SAXException
>*/
>   public static void main(String[] args) throws IOException, SAXException,
>   TikaException {
>   Metadata metadata = new Metadata();
>   metadata.set(Metadata.CONTENT_TYPE, "text/html");
>   metadata.set(Metadata.CONTENT_ENCODING, "UTF-8");
>   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>   InputStream in = new 
> ByteArrayInputStream(content.getBytes("UTF-8"));
>   AutoDetectParser parser = new AutoDetectParser();
>   BodyContentHandler h = new BodyContentHandler(1);
>   parser.parse(in, h, metadata, new ParseContext());
>   System.out.print(h.toString());
>   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>   }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-817) (PPT/PPTX) Missing date/time in text content.

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-817:
---

Fix Version/s: (was: 1.2)
   1.3

- push to 1.3

> (PPT/PPTX) Missing date/time in text content.
> -
>
> Key: TIKA-817
> URL: https://issues.apache.org/jira/browse/TIKA-817
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.0
> Environment: Win7-64 + java version "1.6.0_26"
>Reporter: Albert L.
> Fix For: 1.3
>
>
> Missing date/time text in text content for PPT and PPTX files.
> The date and time are missing from the text content.  This occurs when one 
> chooses the following with MS-PowerPoint 2010:
> 1) "Insert"
> 2) "Date & Time"
> 3) "Update automatically"
> 4) save to PPT or PPTX

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-817) (PPT/PPTX) Missing date/time in text content.

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-817:
---


- push to 1.3

> (PPT/PPTX) Missing date/time in text content.
> -
>
> Key: TIKA-817
> URL: https://issues.apache.org/jira/browse/TIKA-817
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.0
> Environment: Win7-64 + java version "1.6.0_26"
>Reporter: Albert L.
> Fix For: 1.3
>
>
> Missing date/time text in text content for PPT and PPTX files.
> The date and time are missing from the text content.  This occurs when one 
> chooses the following with MS-PowerPoint 2010:
> 1) "Insert"
> 2) "Date & Time"
> 3) "Update automatically"
> 4) save to PPT or PPTX

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-891:
---

Fix Version/s: (was: 1.2)
   1.3

- push to 1.3

> Use POST in addition to PUT on method calls in tika-server
> --
>
> Key: TIKA-891
> URL: https://issues.apache.org/jira/browse/TIKA-891
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Trivial
> Fix For: 1.3
>
>
> Per Jukka's email:
> http://s.apache.org/uR
> It would be a better use of REST/HTTP "verbs" to use POST to put content to a 
> resource where we don't intend to store that content (which is the 
> implication of PUT). Max suggested adding:
> {code}
> @POST
> {code}
> annotations to the methods we are currently exposing using PUT to take care 
> of this. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-775) Embed Capabilities

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-775:
---

Fix Version/s: (was: 1.2)
   1.3

- push to 1.3

> Embed Capabilities
> --
>
> Key: TIKA-775
> URL: https://issues.apache.org/jira/browse/TIKA-775
> Project: Tika
>  Issue Type: Improvement
>  Components: general, metadata
>Affects Versions: 1.0
> Environment: The default ExternalEmbedder requires that sed be 
> installed.
>Reporter: Ray Gauss II
>  Labels: embed, patch
> Fix For: 1.3
>
> Attachments: tika-core-embed-patch.txt, tika-parsers-embed-patch.txt
>
>
> This patch defines and implements the concept of embedding tika metadata into 
> a file stream, the reverse of extraction.
> In the tika-core project an interface defining an Embedder and a generic sed 
> ExternalEmbedder implementation meant to be extended or configured are added. 
>  These classes are essentially a reverse flow of the existing Parser and 
> ExternalParser classes.
> In the tika-parsers project an ExternalEmbedderTest unit test is added which 
> uses the default ExternalEmbedder (calls sed) to embed a value placed in 
> Metadata.DESCRIPTION then verify the operation by parsing the resulting 
> stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-775) Embed Capabilities

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-775:
---


- push to 1.3

> Embed Capabilities
> --
>
> Key: TIKA-775
> URL: https://issues.apache.org/jira/browse/TIKA-775
> Project: Tika
>  Issue Type: Improvement
>  Components: general, metadata
>Affects Versions: 1.0
> Environment: The default ExternalEmbedder requires that sed be 
> installed.
>Reporter: Ray Gauss II
>  Labels: embed, patch
> Fix For: 1.3
>
> Attachments: tika-core-embed-patch.txt, tika-parsers-embed-patch.txt
>
>
> This patch defines and implements the concept of embedding tika metadata into 
> a file stream, the reverse of extraction.
> In the tika-core project an interface defining an Embedder and a generic sed 
> ExternalEmbedder implementation meant to be extended or configured are added. 
>  These classes are essentially a reverse flow of the existing Parser and 
> ExternalParser classes.
> In the tika-parsers project an ExternalEmbedderTest unit test is added which 
> uses the default ExternalEmbedder (calls sed) to embed a value placed in 
> Metadata.DESCRIPTION then verify the operation by parsing the resulting 
> stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-820) Locator is unset for HTML parser

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-820:
---

Fix Version/s: (was: 1.2)
   1.3

- push to 1.3

> Locator is unset for HTML parser
> 
>
> Key: TIKA-820
> URL: https://issues.apache.org/jira/browse/TIKA-820
> Project: Tika
>  Issue Type: Bug
>  Components: general, parser
>Affects Versions: 1.0
>Reporter: Daniel Bonniot de Ruisselet
>  Labels: patch
> Fix For: 1.3
>
> Attachments: text-locator.patch
>
>
> The HtmlParser does not call setDocumentLocator(Locator locator) on the 
> user's content handler.
> Patch and unit test attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-605) Tika GDAL parser

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-605:
---


- push to 1.3

> Tika GDAL parser
> 
>
> Key: TIKA-605
> URL: https://issues.apache.org/jira/browse/TIKA-605
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
> Environment: indep. of env.
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>  Labels: gdal, integration, tika
> Fix For: 1.3
>
> Attachments: 0001-TIKA-605-Tika-GDAL-parser.patch, 
> TIKA-605.Mattmann.092511.patch.txt
>
>
> Leverage the GDAL toolkit and its Java SWIG bindings to create a Tika parser 
> around GDAL. See here: 
> http://trac.osgeo.org/gdal/browser/trunk/gdal/swig/java/apps/gdalinfo.java

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-820) Locator is unset for HTML parser

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-820:
---


- push to 1.3

> Locator is unset for HTML parser
> 
>
> Key: TIKA-820
> URL: https://issues.apache.org/jira/browse/TIKA-820
> Project: Tika
>  Issue Type: Bug
>  Components: general, parser
>Affects Versions: 1.0
>Reporter: Daniel Bonniot de Ruisselet
>  Labels: patch
> Fix For: 1.3
>
> Attachments: text-locator.patch
>
>
> The HtmlParser does not call setDocumentLocator(Locator locator) on the 
> user's content handler.
> Patch and unit test attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-605) Tika GDAL parser

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-605:
---

Fix Version/s: (was: 1.2)
   1.3

- push to 1.3

> Tika GDAL parser
> 
>
> Key: TIKA-605
> URL: https://issues.apache.org/jira/browse/TIKA-605
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
> Environment: indep. of env.
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>  Labels: gdal, integration, tika
> Fix For: 1.3
>
> Attachments: 0001-TIKA-605-Tika-GDAL-parser.patch, 
> TIKA-605.Mattmann.092511.patch.txt
>
>
> Leverage the GDAL toolkit and its Java SWIG bindings to create a Tika parser 
> around GDAL. See here: 
> http://trac.osgeo.org/gdal/browser/trunk/gdal/swig/java/apps/gdalinfo.java

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-539) Encoding detection is too biased by encoding in meta tag

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-539:
---

Fix Version/s: (was: 1.2)
   1.3

- push to 1.3

> Encoding detection is too biased by encoding in meta tag
> 
>
> Key: TIKA-539
> URL: https://issues.apache.org/jira/browse/TIKA-539
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 0.8, 0.9, 0.10
>Reporter: Reinhard Schwab
>Assignee: Ken Krugler
> Fix For: 1.3
>
> Attachments: TIKA-539.patch, TIKA-539_2.patch
>
>
> if the encoding in the meta tag is wrong, this encoding is detected,
> even if there is the right encoding set in metadata before(which can be  from 
> http response header).
> test code to reproduce:
> static String content = "\n"
>   + " content=\"application/xhtml+xml; charset=iso-8859-1\" />"
>   + "Über den Wolken\n";
>   /**
>* @param args
>* @throws IOException
>* @throws TikaException
>* @throws SAXException
>*/
>   public static void main(String[] args) throws IOException, SAXException,
>   TikaException {
>   Metadata metadata = new Metadata();
>   metadata.set(Metadata.CONTENT_TYPE, "text/html");
>   metadata.set(Metadata.CONTENT_ENCODING, "UTF-8");
>   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>   InputStream in = new 
> ByteArrayInputStream(content.getBytes("UTF-8"));
>   AutoDetectParser parser = new AutoDetectParser();
>   BodyContentHandler h = new BodyContentHandler(1);
>   parser.parse(in, h, metadata, new ParseContext());
>   System.out.print(h.toString());
>   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>   }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-754) Automatic line break insertion (BR element) instead of '\n' in XHTMLContentHandler

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-754:
---

Fix Version/s: (was: 1.2)
   1.3

- push to 1.3

> Automatic line break insertion (BR element) instead of '\n' in 
> XHTMLContentHandler
> --
>
> Key: TIKA-754
> URL: https://issues.apache.org/jira/browse/TIKA-754
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 0.10, 1.0
>Reporter: Pablo Queixalos
>Priority: Minor
> Fix For: 1.3
>
> Attachments: TIKA-754.poc.patch
>
>
> As seen with some parsers (PDF, PPT), some text blocks still contains text 
> carriage returns ('\n') in the outputted XHTML. 
> A global fix for this could be located in XHTMLContentHandler.characters(...).
> By analyzing the given char array, when a '\n' char is encountered insert a 
> BR element instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-819:
---


- push to 1.3

> Make Option to Exclude Embedded Files' Text for Text Content
> 
>
> Key: TIKA-819
> URL: https://issues.apache.org/jira/browse/TIKA-819
> Project: Tika
>  Issue Type: New Feature
>  Components: general
>Affects Versions: 1.0
> Environment: Windows-7 + JDK 1.6 u26
>Reporter: Albert L.
> Fix For: 1.3
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the 
> option to disable text from the PPTX from showing up when asking for the text 
> content from DOCX.  In other words, it would be nice to have the option to 
> get text content *only* from the DOCX instead of the DOCX+PPTX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-776) ExifTool Embedder

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-776:
---


- push to 1.3

> ExifTool Embedder
> -
>
> Key: TIKA-776
> URL: https://issues.apache.org/jira/browse/TIKA-776
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata
>Affects Versions: 1.0
> Environment: ExifTool is required 
> (http://www.sno.phy.queensu.ca/~phil/exiftool/)
>Reporter: Ray Gauss II
>  Labels: embed, exiftool, patch
> Fix For: 1.3
>
> Attachments: tika-parsers-exiftool-embed-patch.txt
>
>
> This patch adds an ExifTool ExternalEmbedder which builds upon the work in 
> issue TIKA-774 and TIKA-775.
> In the tika-parsers an ExiftoolExternalEmbedder is added which extends 
> ExternalEmbedder to programmatically create an Embedder which calls the 
> ExifTool command line to embed tika metadata into a file stream and an 
> ExiftoolExternalEmbedderTest unit test is added which embeds several IPTC and 
> XMP fields then parses the resulting file stream to verify the operation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-819:
---

Fix Version/s: (was: 1.2)
   1.3

- push to 1.3

> Make Option to Exclude Embedded Files' Text for Text Content
> 
>
> Key: TIKA-819
> URL: https://issues.apache.org/jira/browse/TIKA-819
> Project: Tika
>  Issue Type: New Feature
>  Components: general
>Affects Versions: 1.0
> Environment: Windows-7 + JDK 1.6 u26
>Reporter: Albert L.
> Fix For: 1.3
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the 
> option to disable text from the PPTX from showing up when asking for the text 
> content from DOCX.  In other words, it would be nice to have the option to 
> get text content *only* from the DOCX instead of the DOCX+PPTX.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-754) Automatic line break insertion (BR element) instead of '\n' in XHTMLContentHandler

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-754:
---


- push to 1.3

> Automatic line break insertion (BR element) instead of '\n' in 
> XHTMLContentHandler
> --
>
> Key: TIKA-754
> URL: https://issues.apache.org/jira/browse/TIKA-754
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 0.10, 1.0
>Reporter: Pablo Queixalos
>Priority: Minor
> Fix For: 1.3
>
> Attachments: TIKA-754.poc.patch
>
>
> As seen with some parsers (PDF, PPT), some text blocks still contains text 
> carriage returns ('\n') in the outputted XHTML. 
> A global fix for this could be located in XHTMLContentHandler.characters(...).
> By analyzing the given char array, when a '\n' char is encountered insert a 
> BR element instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-776) ExifTool Embedder

2012-07-01 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-776:
---

Fix Version/s: (was: 1.2)
   1.3

- push to 1.3

> ExifTool Embedder
> -
>
> Key: TIKA-776
> URL: https://issues.apache.org/jira/browse/TIKA-776
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata
>Affects Versions: 1.0
> Environment: ExifTool is required 
> (http://www.sno.phy.queensu.ca/~phil/exiftool/)
>Reporter: Ray Gauss II
>  Labels: embed, exiftool, patch
> Fix For: 1.3
>
> Attachments: tika-parsers-exiftool-embed-patch.txt
>
>
> This patch adds an ExifTool ExternalEmbedder which builds upon the work in 
> issue TIKA-774 and TIKA-775.
> In the tika-parsers an ExiftoolExternalEmbedder is added which extends 
> ExternalEmbedder to programmatically create an Embedder which calls the 
> ExifTool command line to embed tika metadata into a file stream and an 
> ExiftoolExternalEmbedderTest unit test is added which embeds several IPTC and 
> XMP fields then parses the resulting file stream to verify the operation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (TIKA-945) Upgrade tika-server to CXF 2.6.1

2012-07-01 Thread Chris A. Mattmann (JIRA)
Chris A. Mattmann created TIKA-945:
--

 Summary: Upgrade tika-server to CXF 2.6.1 
 Key: TIKA-945
 URL: https://issues.apache.org/jira/browse/TIKA-945
 Project: Tika
  Issue Type: Bug
  Components: server
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Critical
 Fix For: 1.2


Per discussions on the dev list:

http://s.apache.org/ApK

Let's upgrade tika-server to Apache CXF 2.6.1 in order to reduce the 
dependencies.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-758) Address TODOs when we upgrade to next PDFBox release

2012-07-01 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404833#comment-13404833
 ] 

Michael McCandless commented on TIKA-758:
-

Looks like the TODOs are all in PDF2XHTML.java, currently:

{noformat}
mike@vine:/l/tika.trunk$ grep -r TODO . | grep -i PDFBOX | grep .java:
./tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java:// 
TODO: remove once PDFBOX-1130 is fixed:
./tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java: 
   // TODO: remove once PDFBOX-1143 is fixed:
./tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java:
// TODO: remove once PDFBOX-1130 is fixed
./tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java:
// TODO: remove once PDFBOX-1130 is fixed
{noformat}


> Address TODOs when we upgrade to next PDFBox release
> 
>
> Key: TIKA-758
> URL: https://issues.apache.org/jira/browse/TIKA-758
> Project: Tika
>  Issue Type: Improvement
>Reporter: Michael McCandless
>
> Like TIKA-757 for POI, I'm opening this blanket issue to address any TODOs in 
> the code when we next upgrade PDFBox.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: JAX-RS overhead in tika-server

2012-07-01 Thread Mattmann, Chris A (388J)
Hey Jukka,

On Jul 1, 2012, at 12:01 PM, Jukka Zitting wrote:

> Hi,
> 
> On Sun, Jul 1, 2012 at 6:27 PM, Mattmann, Chris A (388J)
>  wrote:
>> On Jul 1, 2012, at 5:09 AM, Jukka Zitting wrote:
>> Sergey Beryozkin (who I'm CC'ing on this email since I'm not sure
>> he's subscribed to dev@) helped by providing guidance on the CXF
>> side while I was working on this with Max. If you scope out [1], Max
>> brought up the large # of dependencies too and Sergey's response
>> was that in 2.6 there are only a few required dependencies:
> 
> OK, sounds much better already.

Coolio. I'll add this to my TODO list over the next few weeks, I added:

https://issues.apache.org/jira/browse/TIKA-945

To track it. I listed it as critical for 1.2.

I also plan to spin a 1.2 release candidate at some point in the next week or
so. I realize the metadata stuff isn't done yet, but it's better to simply 
release
early and often, and then when stuff is ready it can be included. Releasing
is a light-weight process in Tika, so it's no biggie.

> 
>> These are CXF dependencies, which I'm sure there are relevant
>> entries in Apache CXF for them, no? And, aren't we eating our own
>> dog food here for this?
> 
> That's right, but we still need to look into the relevant licensing
> details of CXF and all it's dependencies to see if there's any
> non-ALv2 code that needs to be mentioned also in our LICENSE/NOTICE
> files since we're embedding all that code inside the tika-server jar.

ACK, got it.

> 
>> Does HTTP core provide JAX-RS support?
> 
> No. I'm just wondering if we even need JAX-RS in the first place as
> the exposed services are fairly simple.

Cool cool. Let's see if we can solve some of this stuff with the 2.6.1 upgrade
and then go from there.

Thanks dude.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[jira] [Reopened] (TIKA-758) Address TODOs when we upgrade to next PDFBox release

2012-07-01 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting reopened TIKA-758:



OK, thanks! Reopening until those ones are addressed.

> Address TODOs when we upgrade to next PDFBox release
> 
>
> Key: TIKA-758
> URL: https://issues.apache.org/jira/browse/TIKA-758
> Project: Tika
>  Issue Type: Improvement
>Reporter: Michael McCandless
>
> Like TIKA-757 for POI, I'm opening this blanket issue to address any TODOs in 
> the code when we next upgrade PDFBox.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (TIKA-757) Address TODOs when we upgrade to next POI release (3.8 beta 5)

2012-07-01 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-757.
-

   Resolution: Fixed
Fix Version/s: (was: 1.3)
   1.2

I'm going to mark this as closed, as we have solved all but one TODO (around 
XSLF), and that probably warrants it's own jira issue.

> Address TODOs when we upgrade to next POI release (3.8 beta 5)
> --
>
> Key: TIKA-757
> URL: https://issues.apache.org/jira/browse/TIKA-757
> Project: Tika
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Fix For: 1.2
>
>
> I'm opening a blanket issue to remind us all to address the TODOs in the 
> sources for when we upgrade to the next POI.
> I think this (a single blanket issue) is better than keeping separate issues 
> open even though they are technically fixed?
> For example, I've committed TIKA-753 (speedups for embedded office docs), yet 
> it included some TODOs for further speedups possible once we upgrade POI.  
> Rather than keeping TIKA-753 (and others like it) open, I think we should 
> resolve them and let this issue cover all the TODOs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (TIKA-946) Improve how the PPTX parser uses XLSF from POI

2012-07-01 Thread Nick Burch (JIRA)
Nick Burch created TIKA-946:
---

 Summary: Improve how the PPTX parser uses XLSF from POI
 Key: TIKA-946
 URL: https://issues.apache.org/jira/browse/TIKA-946
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
Reporter: Nick Burch


One last bit from TIKA-757 and TIKA-805 - the current way that PPTX files are 
parsed using XSLF from Apache POI has a couple of last remaining low level 
parts.

We should avoid the need to go from the usermodel XMLSlideShow to the low level 
XSLFSlideShow to do the text extraction (occurs in 
XSLFPowerPointExtractorDecorator).

We should also update the usermodel slide support to extract out the slide 
names from docProps/app.xml, so that these can be included in the text output 
easily (in XSLFPowerPointExtractor)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: JAX-RS overhead in tika-server

2012-07-01 Thread Nick Burch

On Sun, 1 Jul 2012, Mattmann, Chris A (388J) wrote:
I also plan to spin a 1.2 release candidate at some point in the next 
week or so. I realize the metadata stuff isn't done yet, but it's better 
to simply release early and often, and then when stuff is ready it can 
be included. Releasing is a light-weight process in Tika, so it's no 
biggie.


A release does have one big effect though - it firms up the API and 
requires us to be backwards compatible with that going forward. It can be 
a big pain if an in-progress API is suddenly effectively frozen by the 
need to be compatible into the future...


Nick


[jira] [Resolved] (TIKA-513) Support of Deja Vu (DjVu) format

2012-07-01 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-513.


Resolution: Won't Fix

Resolving as Won't Fix until there's an upstream library that [we can 
use|http://www.apache.org/legal/resolved.html].

Meanwhile, as noted by Nick, anyone can make a 3rd party parser plugin for Tika 
based on the existing libraries.

> Support of Deja Vu (DjVu) format
> 
>
> Key: TIKA-513
> URL: https://issues.apache.org/jira/browse/TIKA-513
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Oleg Tikhonov
>
> It might be great if Tika could provide such a parser. Any 
> suggestions/thoughts? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Build failed in Jenkins: Tika-trunk #887

2012-07-01 Thread Apache Jenkins Server
See 

--
Failed to access build log

hudson.util.IOException2: remote file operation failed: 
/home/jenkins/jenkins-slave/workspace/Tika-trunk at 
hudson.remoting.Channel@36245edd:ubuntu1
at hudson.FilePath.act(FilePath.java:838)
at hudson.FilePath.act(FilePath.java:824)
at hudson.FilePath.toURI(FilePath.java:879)
at hudson.tasks.MailSender.createFailureMail(MailSender.java:278)
at hudson.tasks.MailSender.getMail(MailSender.java:153)
at hudson.tasks.MailSender.execute(MailSender.java:99)
at 
hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.cleanUp(MavenModuleSetBuild.java:1012)
at hudson.model.Run.execute(Run.java:1504)
at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:477)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:239)
Caused by: hudson.remoting.ChannelClosedException: channel is already closed
at hudson.remoting.Channel.send(Channel.java:475)
at hudson.remoting.Request.call(Request.java:110)
at hudson.remoting.Channel.call(Channel.java:646)
at hudson.FilePath.act(FilePath.java:831)
... 10 more
Caused by: java.io.IOException
at hudson.remoting.Channel.close(Channel.java:878)
at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:110)
at hudson.remoting.PingThread.ping(PingThread.java:114)
at hudson.remoting.PingThread.run(PingThread.java:81)
Caused by: java.util.concurrent.TimeoutException: Ping started on 1341183218470 
hasn't completed at 1341183458470
... 2 more
Caused by: java.util.concurrent.TimeoutException
at hudson.remoting.Request$1.get(Request.java:249)
at hudson.remoting.Request$1.get(Request.java:184)
at hudson.remoting.FutureAdapter.get(FutureAdapter.java:59)
at hudson.remoting.PingThread.ping(PingThread.java:107)
... 1 more


Re: JAX-RS overhead in tika-server

2012-07-01 Thread Mattmann, Chris A (388J)
Hey Nick,

On Jul 1, 2012, at 2:52 PM, Nick Burch wrote:

> On Sun, 1 Jul 2012, Mattmann, Chris A (388J) wrote:
>> I also plan to spin a 1.2 release candidate at some point in the next week 
>> or so. I realize the metadata stuff isn't done yet, but it's better to 
>> simply release early and often, and then when stuff is ready it can be 
>> included. Releasing is a light-weight process in Tika, so it's no biggie.
> 
> A release does have one big effect though - it firms up the API and requires 
> us to be backwards compatible with that going forward.

What API changes do you envision as part of the metadata stuff? So far, I've 
seen met key updates, property class stuff, and a new proposed
module (tika-xmp) which is great and which I need to review btw :)

As for back compat, that's definitely high on the priority list, but so is 
releasing and we need to be mindful not to let
ongoing changes freeze our ability to simply release the software.

> It can be a big pain if an in-progress API is suddenly effectively frozen by 
> the need to be compatible into the future...

Agreed -- so, what do you think? How much longer do we need to wrap up the API 
changes or whatever going on
with the metadata stuff? A week? 2 weeks? 

There's nothing super pressing for me to do an RC, I just brought this up a 
month ago [1] and received a similar
answer, so I'd like us to be able to make a release soon. It would be great to 
roll a 1.2 RC #1 sometime in the next
2 weeks.

Thoughts?

Cheers,
Chris

[1] http://s.apache.org/MFq

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: JAX-RS overhead in tika-server

2012-07-01 Thread Nick Burch

On Sun, 1 Jul 2012, Mattmann, Chris A (388J) wrote:
It can be a big pain if an in-progress API is suddenly effectively 
frozen by the need to be compatible into the future...


Agreed -- so, what do you think? How much longer do we need to wrap up 
the API changes or whatever going on with the metadata stuff? A week? 2 
weeks?


Dunno, all my bits are in

Ray? Jörg? Are there any outstanding bits left? Anything you need more 
feedback/review on, before Chris does the release?


Nick

Re: JAX-RS overhead in tika-server

2012-07-01 Thread Ray Gauss II
Is there any consensus on TIKA-930 [1]?  I don't want to wipe out properties 
that others feel are critical or include non-ratified standards if that's 
outside of policy.

If there's no objection to what's outlined in that issue I can commit those 
changes tomorrow morning (GMT -4).

Regards,

Ray


[1] https://issues.apache.org/jira/browse/TIKA-930


On Jul 1, 2012, at 7:22 PM, Nick Burch wrote:

> On Sun, 1 Jul 2012, Mattmann, Chris A (388J) wrote:
>>> It can be a big pain if an in-progress API is suddenly effectively frozen 
>>> by the need to be compatible into the future...
>> 
>> Agreed -- so, what do you think? How much longer do we need to wrap up the 
>> API changes or whatever going on with the metadata stuff? A week? 2 weeks?
> 
> Dunno, all my bits are in
> 
> Ray? Jörg? Are there any outstanding bits left? Anything you need more 
> feedback/review on, before Chris does the release?
> 
> Nick



[jira] [Commented] (TIKA-930) Consolidation of Some Tika Core Properties

2012-07-01 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404869#comment-13404869
 ] 

Nick Burch commented on TIKA-930:
-

In terms of dcmi, we tend to take a pragmatic view on metadata standards. If 
it's good enough to be useful, and it won't confuse, use it! Try to keep things 
simple though, so don't include a whole standard just for the sake of it... But 
if it provides value then go for it

> Consolidation of Some Tika Core Properties
> --
>
> Key: TIKA-930
> URL: https://issues.apache.org/jira/browse/TIKA-930
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Affects Versions: 1.2
>Reporter: Ray Gauss II
>
> There are a few properties in TikaCoreProperties which overlap and I think we 
> should minimize ambiguity by consolidating them into a single composite 
> property with the clearest name, the most general specification referenced as 
> its primary property, and the others and deprecated strings as its 
> secondaries.
> Here's the proposed pseudo-code for the changes:
> Remove TikaCoreProperties.SUBJECT
> TikaCoreProperties.KEYWORDS <- DublinCore.SUBJECT, { Office.KEYWORDS, 
> MSOffice.KEYWORDS, Metadata.SUBJECT }
> Remove TikaCoreProperties.DATE
> TikaCoreProperties.CREATION_DATE <- DublinCore.DATE, { Office.CREATION_DATE, 
> MSOffice.CREATION_DATE, Metadata.DATE }
> Remove TikaCoreProperties.MODIFIED
> TikaCoreProperties.SAVE_DATE <- DublinCore.MODIFIED, { Office.SAVE_DATE, 
> MSOffice.LAST_SAVED, Metadata.MODIFIED, "Last-Modified" }
> and an example of the Java changes:
> {code:title=TikaCoreProperties.java *Before*}
> /**
>  * @see DublinCore#SUBJECT
>  */
> public static final Property SUBJECT = 
> Property.composite(DublinCore.SUBJECT, 
> new Property[] { Property.internalText(Metadata.SUBJECT) });
>   
> /**
>  * @see Office#KEYWORDS
>  */
> public static final Property KEYWORDS = 
> Property.composite(Office.KEYWORDS,
> new Property[] { Property.internalTextBag(MSOffice.KEYWORDS) });
> {code}
> would become
> {code:title= TikaCoreProperties.java *After*}
> /**
>  * @see DublinCore#SUBJECT
>  * @see Office#KEYWORDS
>  */
> public static final Property KEYWORDS = 
> Property.composite(DublinCore.SUBJECT,
> new Property[] { 
>   Office.KEYWORDS, 
>   Property.internalTextBag(MSOffice.KEYWORDS),
>   Property.internalText(Metadata.SUBJECT)
>   });
> {code}
> Since this would require a bit of refactoring for parsers that use the 
> properties being removed I thought it best to get some feedback before 
> working up a full patch.
> Does this seem like a reasonable approach?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: JAX-RS overhead in tika-server

2012-07-01 Thread Mattmann, Chris A (388J)
Hi Ray,

I'd say what you proposed is fine by me. Doesn't seem to have generated
objections so please move forward.

Cheers,
Chris

On Jul 1, 2012, at 4:56 PM, Ray Gauss II wrote:

> Is there any consensus on TIKA-930 [1]?  I don't want to wipe out properties 
> that others feel are critical or include non-ratified standards if that's 
> outside of policy.
> 
> If there's no objection to what's outlined in that issue I can commit those 
> changes tomorrow morning (GMT -4).
> 
> Regards,
> 
> Ray
> 
> 
> [1] https://issues.apache.org/jira/browse/TIKA-930
> 
> 
> On Jul 1, 2012, at 7:22 PM, Nick Burch wrote:
> 
>> On Sun, 1 Jul 2012, Mattmann, Chris A (388J) wrote:
 It can be a big pain if an in-progress API is suddenly effectively frozen 
 by the need to be compatible into the future...
>>> 
>>> Agreed -- so, what do you think? How much longer do we need to wrap up the 
>>> API changes or whatever going on with the metadata stuff? A week? 2 weeks?
>> 
>> Dunno, all my bits are in
>> 
>> Ray? Jörg? Are there any outstanding bits left? Anything you need more 
>> feedback/review on, before Chris does the release?
>> 
>> Nick
> 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++