date:20141024

[jira] [Created] (TIKA-1456) Visual Sentiment API parser

2014-10-24 Thread Chris A. Mattmann (JIRA)

Chris A. Mattmann created TIKA-1456:
---

 Summary: Visual Sentiment API parser
 Key: TIKA-1456
 URL: https://issues.apache.org/jira/browse/TIKA-1456
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7


Integrate the Visual Sentibank API as a parser for images. We can use Aperture 
from CMU, it's released under the MIT license:

https://github.com/d8w/aperture



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-24 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173983#comment-14173983
 ] 

Tilman Hausherr edited comment on TIKA-1442 at 10/24/14 11:02 AM:
--

After some more research, I was able to decode 5 more files (the cause was not 
the LZW filter, see PDFBOX-2296, but I fixed this only in 2.0). However 7 other 
files are really corrupt, portions of the files are blank when shown in AR:

115/115269.pdf
211/211876.pdf
268/268346.pdf
389/389474.pdf
443/443752.pdf
698/698813.pdf
846/846759.pdf


was (Author: tilman):
After some more research, I was able to decode 5 more files (the cause was not 
the LZW filter, see ). However 7 other files are really corrupt, portions of 
the files are blank when shown in AR:

115/115269.pdf
211/211876.pdf
268/268346.pdf
389/389474.pdf
443/443752.pdf
698/698813.pdf
846/846759.pdf

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1451) Add Recursive Metadata Parser Wrapper output to tika-app and gui

2014-10-24 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182674#comment-14182674
 ] 

Tim Allison commented on TIKA-1451:
---

Thank you, Chris.  The credit goes to [~jukkaz] and [~gagravarr] for the 
recursive parser example!  I'm grateful to now have an out-of-the-box format 
(w/ serializers and deserializers) that captures embedded document metadata.

As I was working on this, I was starting to think that we might want to add 
some tika: prefixed properties to TikaCoreProperties to capture metadata 
generated during processing, such as: tika:content, tika:parse_time_millis, 
tika:exception, tika:parsed_by (instead of our current X-Parsed-By).  In 
effect, move the RecursiveParserWrapper properties to TikaCoreProperties and 
add some others as necessary.



 Add Recursive Metadata Parser Wrapper output to tika-app and gui
 

 Key: TIKA-1451
 URL: https://issues.apache.org/jira/browse/TIKA-1451
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.7

 Attachments: integrate_recursive_metadata_wrapper.patch


 It would be helpful to expose the output of the recursive metadata parser 
 wrapper in the gui and in the command line for tika-app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

RE: import (re)ordering?

2014-10-24 Thread Allison, Timothy B.

Y, I'll try to be more careful about separating out formatting from content in 
the future (apologies for TIKA-1451).  What I didn't want to do was start an 
IDE war if others have different settings that will order imports in a 
different way.

Thank you!

-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Friday, October 24, 2014 1:53 AM
To: dev@tika.apache.org
Subject: Re: import (re)ordering?

Hey Tim,

No big objections from me, but it will dilute things so glad we
have it noted if it happens.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Allison, Timothy B. talli...@mitre.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Tuesday, October 21, 2014 at 1:59 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: import (re)ordering?

All,
  I have Intellij set to order imports by javax, java, then other.  I
think this is the most common pattern in Tika.  Is it ok if I make these
(meaningless/formatting) changes when I commit other changes?
  Thank you.

   Best,

  Tim

RE: import (re)ordering?

2014-10-24 Thread Tyler Palsulich

Thanks, Tim. I'll be sure to update my settings for this. On a similar
note, can we standardize the formatting of the pom.xml files? Right now,
they are pretty irregular.

Tyler
On Oct 24, 2014 10:52 AM, Allison, Timothy B. talli...@mitre.org wrote:

 Y, I'll try to be more careful about separating out formatting from
 content in the future (apologies for TIKA-1451).  What I didn't want to do
 was start an IDE war if others have different settings that will order
 imports in a different way.

 Thank you!

 -Original Message-
 From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
 Sent: Friday, October 24, 2014 1:53 AM
 To: dev@tika.apache.org
 Subject: Re: import (re)ordering?

 Hey Tim,

 No big objections from me, but it will dilute things so glad we
 have it noted if it happens.

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Allison, Timothy B. talli...@mitre.org
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Tuesday, October 21, 2014 at 1:59 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: import (re)ordering?

 All,
   I have Intellij set to order imports by javax, java, then other.  I
 think this is the most common pattern in Tika.  Is it ok if I make these
 (meaningless/formatting) changes when I commit other changes?
   Thank you.
 
Best,
 
   Tim

RE: import (re)ordering?

2014-10-24 Thread Nick Burch


On Fri, 24 Oct 2014, Allison, Timothy B. wrote:
Y, I'll try to be more careful about separating out formatting from 
content in the future (apologies for TIKA-1451).  What I didn't want to 
do was start an IDE war if others have different settings that will 
order imports in a different way.


I'd say just pick something sensible, and then document it for everyone in
http://tika.apache.org/contribute.html so it's clear what to do!

Nick

[jira] [Resolved] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-24 Thread Tyler Palsulich (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1422.
---
Resolution: Fixed

Fixed in r1634094. Skip over the two failing checks if Tesseract is installed.

 org.apache.tika.parser.mail.RFC822ParserTest fails
 --

 Key: TIKA-1422
 URL: https://issues.apache.org/jira/browse/TIKA-1422
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
 TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, 
 TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch


 I'm seeing test failures from:
 {noformat}
 Results :
 Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
 (..)
 Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
 {noformat}
 CentOS6 VM image, running:
 {noformat}
 [mattmann@memex tika]$ java -version
 java version 1.7.0_67
 Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
 Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
 [mattmann@memex tika]$ mvn -version
 Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
 2014-02-14T09:37:52-08:00)
 Maven home: /usr/share/apache-maven
 Java version: 1.7.0_65, vendor: Oracle Corporation
 Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: 
 amd64, family: unix
 [mattmann@memex tika]$ 
 {noformat}
 Here are the surefire reports - no clue what's up here:
 {noformat}
 [mattmann@memex tika]$ more 
 tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
  
 ---
 Test set: org.apache.tika.parser.mail.RFC822ParserTest
 ---
 Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec  
 FAILURE!
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.152 sec   FAILURE!
 org.mockito.exceptions.verification.TooManyActualInvocations: 
 xHTMLContentHandler.startElement(
 http://www.w3.org/1999/xhtml;,
 div,
 div,
 isA(org.xml.sax.Attributes)
 );
 Wanted 4 times but was 5
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
 Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
 Undesired invocation:
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
   at 
 org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
   at 
 org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183103#comment-14183103
 ] 

Hudson commented on TIKA-1422:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #282 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/282/])
TIKA-1422. Skip checking the number of some handler invocations in the 
RFC822ParserTest if Tesseract is installed. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1634094)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java


 org.apache.tika.parser.mail.RFC822ParserTest fails
 --

 Key: TIKA-1422
 URL: https://issues.apache.org/jira/browse/TIKA-1422
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
 TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, 
 TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch


 I'm seeing test failures from:
 {noformat}
 Results :
 Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
 (..)
 Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
 {noformat}
 CentOS6 VM image, running:
 {noformat}
 [mattmann@memex tika]$ java -version
 java version 1.7.0_67
 Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
 Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
 [mattmann@memex tika]$ mvn -version
 Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
 2014-02-14T09:37:52-08:00)
 Maven home: /usr/share/apache-maven
 Java version: 1.7.0_65, vendor: Oracle Corporation
 Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: 
 amd64, family: unix
 [mattmann@memex tika]$ 
 {noformat}
 Here are the surefire reports - no clue what's up here:
 {noformat}
 [mattmann@memex tika]$ more 
 tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
  
 ---
 Test set: org.apache.tika.parser.mail.RFC822ParserTest
 ---
 Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec  
 FAILURE!
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.152 sec   FAILURE!
 org.mockito.exceptions.verification.TooManyActualInvocations: 
 xHTMLContentHandler.startElement(
 http://www.w3.org/1999/xhtml;,
 div,
 div,
 isA(org.xml.sax.Attributes)
 );
 Wanted 4 times but was 5
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
 Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
 Undesired invocation:
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
   at 
 org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
   at 
 org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183161#comment-14183161
 ] 

Hudson commented on TIKA-1422:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #262 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/262/])
TIKA-1422. Skip checking the number of some handler invocations in the 
RFC822ParserTest if Tesseract is installed. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1634094)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java


 org.apache.tika.parser.mail.RFC822ParserTest fails
 --

 Key: TIKA-1422
 URL: https://issues.apache.org/jira/browse/TIKA-1422
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
 TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, 
 TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch


 I'm seeing test failures from:
 {noformat}
 Results :
 Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
 (..)
 Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
 {noformat}
 CentOS6 VM image, running:
 {noformat}
 [mattmann@memex tika]$ java -version
 java version 1.7.0_67
 Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
 Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
 [mattmann@memex tika]$ mvn -version
 Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
 2014-02-14T09:37:52-08:00)
 Maven home: /usr/share/apache-maven
 Java version: 1.7.0_65, vendor: Oracle Corporation
 Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: 
 amd64, family: unix
 [mattmann@memex tika]$ 
 {noformat}
 Here are the surefire reports - no clue what's up here:
 {noformat}
 [mattmann@memex tika]$ more 
 tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
  
 ---
 Test set: org.apache.tika.parser.mail.RFC822ParserTest
 ---
 Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec  
 FAILURE!
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.152 sec   FAILURE!
 org.mockito.exceptions.verification.TooManyActualInvocations: 
 xHTMLContentHandler.startElement(
 http://www.w3.org/1999/xhtml;,
 div,
 div,
 isA(org.xml.sax.Attributes)
 );
 Wanted 4 times but was 5
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
 Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
 Undesired invocation:
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
   at 
 org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
   at 
 org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Re: 1.7 release?

2014-10-24 Thread Oleg Tikhonov

Hi Tyler,
don't mention.

Cheers,
Oleg
On Oct 24, 2014 8:02 PM, Tyler Palsulich tpalsul...@gmail.com wrote:

 Thank you for the help, Oleg! I just resolved TIKA-1422. So, are there any
 other issues anyone would like to resolve before a new release?

 Thanks,
 Tyler

 On Tue, Oct 21, 2014 at 2:42 AM, Oleg Tikhonov olegtikho...@gmail.com
 wrote:

  Sorry!!!
 
  On Tue, Oct 21, 2014 at 9:37 AM, Mattmann, Chris A (3980) 
  chris.a.mattm...@jpl.nasa.gov wrote:
 
   Thanks Oleg, will try tomorrow for me Los angeles time!
  
   ++
   Chris Mattmann, Ph.D.
   Chief Architect
   Instrument Software and Science Data Systems Section (398)
   NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
   Office: 168-519, Mailstop: 168-527
   Email: chris.a.mattm...@nasa.gov
   WWW:  http://sunset.usc.edu/~mattmann/
   ++
   Adjunct Associate Professor, Computer Science Department
   University of Southern California, Los Angeles, CA 90089 USA
   ++
  
  
  
  
  
  
   -Original Message-
   From: Oleg Tikhonov o...@apache.org
   Reply-To: dev@tika.apache.org dev@tika.apache.org
   Date: Monday, October 20, 2014 at 11:20 PM
   To: dev@tika.apache.org dev@tika.apache.org
   Subject: Re: 1.7 release?
  
   Please take a try with newest patch.
   Cheers,
   Oleg
   
   On Tue, Oct 21, 2014 at 9:08 AM, Oleg Tikhonov 
 olegtikho...@gmail.com
   wrote:
   
Taken. Thanks. in progress ...
   
On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:
   
Trunk is the current checkout/branch:
   
http://svn.apache.org/repos/asf/tika/trunk
   
   
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++
   
   
   
   
   
   
-Original Message-
From: Oleg Tikhonov olegtikho...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Monday, October 20, 2014 at 10:16 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Re: 1.7 release?
   
Hi, I can try this on.
What is a trunk?


Thanks,
Oleg

On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hmm any idea why this is failing on Windows? Tyler P. and
 I were talking the other day - maybe we shouldn't run the
 tests from TIKA-1422 unless Tesseract is installed? Thoughts?


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/

 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA

 ++






 -Original Message-
 From: Hong-Thai Nguyen thaicha...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Thursday, October 16, 2014 at 2:03 AM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: 1.7 release?

 Hi Andrzej,
 
 We are impatient for 1.7 release too.
 I'm having compiling problem of TIKA-1422 on me. If anyone can
   build
 successfully on Windows, I have no objection to release 1.7
 
 Thanks,
 
 On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki 
  a...@getopt.org
wrote:
 
  Hi,
 
  Any news on the 1.7 release? or at least a 1.6.1 release that
includes
 the
  fix for broken ODF parsing...
 
  ---
  Best regards,
 
  Andrzej Bialecki
 
 
 
 
 --
 --
 Hong-Thai

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-24 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183338#comment-14183338
 ] 

Tim Allison commented on TIKA-1442:
---

Hmmm...I can't explain those files, and I recently did some cleanup so I don't 
have the original 1.8.6 output.  When I recently reran with the latest Tika 
trunk, I got the same number of metadata values for those files with PDFBox 
1.8.6 and 1.8.8-SNAPSHOT (vintage 2 days ago).  All the problematic files have 
attachments. 

I wonder if recent work on the OCR parser could explain this. [~tpalsulich], 
over the last few weeks, was there a time when we were extracting metadata from 
images, but now we're not?

For 224644.pdf, for example, there doesn't seem to be much metadata for the 
jpgs now...a total of 40 metadata values for the full document.  Last week, 
when I ran Tika, there were 160, metadata values.
{noformat}
{Content-Length:5970,Content-Type:image/jpeg,X-Parsed-By:[org.apache.tika.parser.DefaultParser,org.apache.tika.parser.ocr.TesseractOCRParser],embeddedResourceType:ATTACHMENT,resourceName:arrow.jpg,tika:embedded_resource_path:224644.pdf/arrow.jpg},{Content-Length:5970,Content-Type:image/jpeg,X-Parsed-By:[org.apache.tika.parser.DefaultParser,org.apache.tika.parser.ocr.TesseractOCRParser],embeddedResourceType:ATTACHMENT,resourceName:arrow.jpg,tika:embedded_resource_path:224644.pdf/arrow.jpg}]
{noformat} 

In short, [~tilman], I don't think this is a PDFBox issue.

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-24 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183338#comment-14183338
 ] 

Tim Allison edited comment on TIKA-1442 at 10/24/14 7:22 PM:
-

Hmmm...I can't explain those files, and I recently did some cleanup so I don't 
have the original 1.8.6 output.  When I recently reran with the latest Tika 
trunk, I got the same number of metadata values for those files with PDFBox 
1.8.6 and 1.8.8-SNAPSHOT (vintage 2 days ago).  All the problematic files have 
attachments. 

I wonder if recent work on the OCR parser could explain this. [~tpalsulich], 
over the last few weeks, was there a time when we were extracting metadata from 
images, but now we're not?

For 224644.pdf, for example, there doesn't seem to be much metadata for the 
jpgs now...a total of 40 metadata values for the full document.  Last week, 
when I ran Tika, there were 160, metadata values.
{noformat}
{Content-Length:5970,Content-Type:image/jpeg,
X-Parsed-By:[org.apache.tika.parser.DefaultParser,org.apache.tika.parser.ocr.TesseractOCRParser],
embeddedResourceType:ATTACHMENT,resourceName:arrow.jpg,
tika:embedded_resource_path:224644.pdf/arrow.jpg},
{Content-Length:5970,Content-Type:image/jpeg,
X-Parsed-By:[org.apache.tika.parser.DefaultParser,org.apache.tika.parser.ocr.TesseractOCRParser],
embeddedResourceType:ATTACHMENT,resourceName:arrow.jpg,
tika:embedded_resource_path:224644.pdf/arrow.jpg}]
{noformat} 

In short, [~tilman], I don't think this is a PDFBox issue.


was (Author: talli...@mitre.org):
Hmmm...I can't explain those files, and I recently did some cleanup so I don't 
have the original 1.8.6 output.  When I recently reran with the latest Tika 
trunk, I got the same number of metadata values for those files with PDFBox 
1.8.6 and 1.8.8-SNAPSHOT (vintage 2 days ago).  All the problematic files have 
attachments. 

I wonder if recent work on the OCR parser could explain this. [~tpalsulich], 
over the last few weeks, was there a time when we were extracting metadata from 
images, but now we're not?

For 224644.pdf, for example, there doesn't seem to be much metadata for the 
jpgs now...a total of 40 metadata values for the full document.  Last week, 
when I ran Tika, there were 160, metadata values.
{noformat}
{Content-Length:5970,Content-Type:image/jpeg,X-Parsed-By:[org.apache.tika.parser.DefaultParser,org.apache.tika.parser.ocr.TesseractOCRParser],embeddedResourceType:ATTACHMENT,resourceName:arrow.jpg,tika:embedded_resource_path:224644.pdf/arrow.jpg},{Content-Length:5970,Content-Type:image/jpeg,X-Parsed-By:[org.apache.tika.parser.DefaultParser,org.apache.tika.parser.ocr.TesseractOCRParser],embeddedResourceType:ATTACHMENT,resourceName:arrow.jpg,tika:embedded_resource_path:224644.pdf/arrow.jpg}]
{noformat} 

In short, [~tilman], I don't think this is a PDFBox issue.

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-24 Thread Tyler Palsulich (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183383#comment-14183383
 ] 

Tyler Palsulich commented on TIKA-1442:
---

Yes, unfortunately. Please see TIKA-1445. [~mattmann], any thoughts?

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-24 Thread Tyler Palsulich (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183383#comment-14183383
 ] 

Tyler Palsulich edited comment on TIKA-1442 at 10/24/14 8:05 PM:
-

Yes, unfortunately. Please see TIKA-1445. [~chrismattmann], any thoughts?


was (Author: tpalsulich):
Yes, unfortunately. Please see TIKA-1445. [~mattmann], any thoughts?

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

RE: 1.7 release?

2014-10-24 Thread Allison, Timothy B.

Sorry for coming late to the game on the implications of TIKA-1445.  I don't 
want to hold up the release of 1.7.  

However, would it be possible to return to the legacy default behavior of 
extracting metadata from images?  

We can then document on the OCR parser page on the wiki that you need to 
install Tesseract _and_ make a change in the parser/mime config file. If you 
want this new capability, it will take a small bit of work until we solve 
TIKA-1445.

I worry that the current behavior of 1.7 would be surprising to most non-dev 
users (well, even to at least one dev :) ).

Cheers,
  
  Tim


From: Oleg Tikhonov [olegtikho...@gmail.com]
Sent: Friday, October 24, 2014 2:24 PM
To: dev@tika.apache.org
Subject: Re: 1.7 release?

Hi Tyler,
don't mention.

Cheers,
Oleg
On Oct 24, 2014 8:02 PM, Tyler Palsulich tpalsul...@gmail.com wrote:

 Thank you for the help, Oleg! I just resolved TIKA-1422. So, are there any
 other issues anyone would like to resolve before a new release?

 Thanks,
 Tyler

 On Tue, Oct 21, 2014 at 2:42 AM, Oleg Tikhonov olegtikho...@gmail.com
 wrote:

  Sorry!!!
 
  On Tue, Oct 21, 2014 at 9:37 AM, Mattmann, Chris A (3980) 
  chris.a.mattm...@jpl.nasa.gov wrote:
 
   Thanks Oleg, will try tomorrow for me Los angeles time!
  
   ++
   Chris Mattmann, Ph.D.
   Chief Architect
   Instrument Software and Science Data Systems Section (398)
   NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
   Office: 168-519, Mailstop: 168-527
   Email: chris.a.mattm...@nasa.gov
   WWW:  http://sunset.usc.edu/~mattmann/
   ++
   Adjunct Associate Professor, Computer Science Department
   University of Southern California, Los Angeles, CA 90089 USA
   ++
  
  
  
  
  
  
   -Original Message-
   From: Oleg Tikhonov o...@apache.org
   Reply-To: dev@tika.apache.org dev@tika.apache.org
   Date: Monday, October 20, 2014 at 11:20 PM
   To: dev@tika.apache.org dev@tika.apache.org
   Subject: Re: 1.7 release?
  
   Please take a try with newest patch.
   Cheers,
   Oleg
   
   On Tue, Oct 21, 2014 at 9:08 AM, Oleg Tikhonov 
 olegtikho...@gmail.com
   wrote:
   
Taken. Thanks. in progress ...
   
On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:
   
Trunk is the current checkout/branch:
   
http://svn.apache.org/repos/asf/tika/trunk
   
   
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++
   
   
   
   
   
   
-Original Message-
From: Oleg Tikhonov olegtikho...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Monday, October 20, 2014 at 10:16 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Re: 1.7 release?
   
Hi, I can try this on.
What is a trunk?


Thanks,
Oleg

On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hmm any idea why this is failing on Windows? Tyler P. and
 I were talking the other day - maybe we shouldn't run the
 tests from TIKA-1422 unless Tesseract is installed? Thoughts?


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/

 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA

 ++






 -Original Message-
 From: Hong-Thai Nguyen thaicha...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Thursday, October 16, 2014 at 2:03 AM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: 1.7 release?

 Hi Andrzej,
 
 We are impatient for 1.7 release too.
 I'm having compiling problem of TIKA-1422 on me. If anyone can
   build
 successfully

[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-10-24 Thread Tyler Palsulich (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183873#comment-14183873
]

Tyler Palsulich commented on TIKA-1445:
---

I've been trying my hand at this some time now. An idea I had was to create a
temporary file from the input InputStream, then create new input streams from
that file to run each Parser on.

But, before this OCR Parser, we only ran one Parser on the image, anyway. So,
what if there was a way to get the second best default parser for the image?
An option is to hard code the exact working Parsers. But, in my opinion, we
should load them dynamically. So, that would require getting a
{{ListParser}}, instead of just the best Parser for a given MediaType
({{CompositeParser.getParsers(ParseContext)}}).

If we only chose the second best Parser, we wouldn't have to merge the Metadata
results, since the OCRParser doesn't add Metadata. But, it might call the
ContentHandler.

Figure out how to add Image metadata extraction to Tesseract parser
---

Key: TIKA-1445
URL: https://issues.apache.org/jira/browse/TIKA-1445
Project: Tika
Issue Type: Bug
Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Fix For: 1.7

Attachments: TIKA-1445.Mattmann.101214.patch.txt

Now that Tesseract is the default image parser in Tika for many image types,
consider how to add back in the metadata extraction capabilities by the other
Image parsers.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-774) ExifTool Parser

2014-10-24 Thread Chris A. Mattmann (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris A. Mattmann updated TIKA-774:
---
Fix Version/s: (was: 1.7)
1.8

- push to 1.8

ExifTool Parser
---

Key: TIKA-774
URL: https://issues.apache.org/jira/browse/TIKA-774
Project: Tika
Issue Type: New Feature
Components: parser
Affects Versions: 1.0
Environment: Requires be installed
(http://www.sno.phy.queensu.ca/~phil/exiftool/)
Reporter: Ray Gauss II
Labels: features, newbie, patch,
Fix For: 1.8

Attachments: testJPEG_IPTC_EXT.jpg,
tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt

Adds an external parser that calls ExifTool to extract extended metadata
fields from images and other content types.
In the core project:
An ExifTool interface is added which contains Property objects that define
the metadata fields available.
An additional Property constructor for internalTextBag type.
In the parsers project:
An ExiftoolMetadataExtractor is added which does the work of calling ExifTool
on the command line and mapping the response to tika metadata fields. This
extractor could be called instead of or in addition to the existing
ImageMetadataExtractor and JempboxExtractor under TiffParser and/or
JpegParser but those have not been changed at this time.
An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool
metadata fields to existing tika and Drew Noakes metadata fields if enabled.
An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag
implementations in XML files.
An ExifToolParserTest is added which tests several expected XMP and IPTC
metadata values in testJPEG_IPTC_EXT.jpg.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1208) Migrate Any23 mime contributions to Tika

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1208:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Migrate Any23 mime contributions to Tika
 

 Key: TIKA-1208
 URL: https://issues.apache.org/jira/browse/TIKA-1208
 Project: Tika
  Issue Type: Sub-task
  Components: mime
Reporter: Lewis John McGibbney
 Fix For: 1.8

 Attachments: TIKA-1208.patch


 We begin with one of the most obvious areas in which there
 is overlap.
 In short, the appeal of this package is the addition of detection 
 for the following types:
  - text/n3
  - text/rdf+n3
  - application/n3
  - text/x-nquads
  - text/rdf+nq
  - text/nq
  - application/nq
  - text/turtle
  - application/x-turtle
  - application/turtle
  - application/trix
  
 Therefore although both Tika and Any23 execute the task of Mimetype-related
 tasks, there is a contribution to be made. This involves the trasferral of
 code pertaining to pattern recogition, Mimetype XML defitinions within 
 tika-mimetypes.xml and a Purifier implementation that removes all 
 the eventual blank characters at the header of a file that might 
 prevents its MIME Type detection.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1220) Parser implementration for IFC files

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1220:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Parser implementration for IFC files
 

 Key: TIKA-1220
 URL: https://issues.apache.org/jira/browse/TIKA-1220
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.8

 Attachments: 2012-03-23-Duplex-Programming.ifc


 The Industry Foundation Classes (IFC) [0] data model is intended to describe 
 building and construction industry data. For the sake of argument, it can be 
 considered as a more intelligent successor to the .dwg data models used 
 within CAD models.
 I've tracked down a potential 3rd party library [1] which we maybe able to 
 wrap and use within Tika however the provided software packages are licensed 
 under: http://creativecommons.org/licenses/by-nc-sa/3.0/de/ so I am currently 
 over on legal-discuss@ in an attempt to see if it is possible to wrap some 
 code and contribute it to tika-parsers.
 When I get feedback from legal-discuss, and if this is a go-ahead, I'll need 
 to help the developers package the code as a Maven artifact(s), then I will 
 progress with writing the implementation.  
 [0] http://en.wikipedia.org/wiki/Industry_Foundation_Classes
 [1] http://www.ifctoolsproject.com/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-891:
---
Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Use POST in addition to PUT on method calls in tika-server
 --

 Key: TIKA-891
 URL: https://issues.apache.org/jira/browse/TIKA-891
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Trivial
 Fix For: 1.8


 Per Jukka's email:
 http://s.apache.org/uR
 It would be a better use of REST/HTTP verbs to use POST to put content to a 
 resource where we don't intend to store that content (which is the 
 implication of PUT). Max suggested adding:
 {code}
 @POST
 {code}
 annotations to the methods we are currently exposing using PUT to take care 
 of this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1238) Update OutlookExtractor to handle codepage identification more rigorously

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1238:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Update OutlookExtractor to handle codepage identification more rigorously
 -

 Key: TIKA-1238
 URL: https://issues.apache.org/jira/browse/TIKA-1238
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.8


 Since OutlookExtractor's codepage detection chunk was written, POI's HSMF has 
 added more robutst capabilities for identifying codepages in Outlook .msg 
 files.  As a first step to integrating those improvements, I'll copy and 
 paste some of POI's code into OutlookExtractor.  As a second step, I'll 
 expose more of HSMF's capabilities within POI and then factor out the 
 duplicate code in Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1324) Use a common path for the Tika Server unpacker resources

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1324:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Use a common path for the Tika Server unpacker resources
 

 Key: TIKA-1324
 URL: https://issues.apache.org/jira/browse/TIKA-1324
 Project: Tika
  Issue Type: Improvement
  Components: server
Affects Versions: 1.5
Reporter: Nick Burch
 Fix For: 1.8


 Currently, the two different methods of the Tika Server unpacker endpoint 
 don't share a common url prefix, which causes them to clash with the new 
 welcome endpoint
 As discussed on the mailing list, we should change these two have a common 
 prefix, so that the urls are then:
  * /unpack/{id}
  * /unpack/all/{id}
 After making the change, the changelog and release notes need to be updated 
 for it, as it is a breaking change for the (handful of) users of the endpoint
 This will help with TIKA-1269



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1273) old tika-server jar artifact contains no manifest so not able to invoke from shell

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1273:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 old tika-server jar artifact contains no manifest so not able to invoke from 
 shell
 --

 Key: TIKA-1273
 URL: https://issues.apache.org/jira/browse/TIKA-1273
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.5
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.8


 I've never ever used the old tika-server artifact which is generated when one 
 installs the server module. It needs to contain a manifest otherwise it 
 cannot be invoked from the shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1445:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1384) Use tika-parent dependency management for common dependencies

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1384:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Use tika-parent dependency management for common dependencies
 -

 Key: TIKA-1384
 URL: https://issues.apache.org/jira/browse/TIKA-1384
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
Priority: Minor
 Fix For: 1.8


 If we list a dependency in the dependencyManagement section of the 
 tika-parent pom.xml, we can then include that dependency in a child module 
 without specifying a version.
 For example, I updated the junit dependencies yesterday: 
 https://github.com/apache/tika/commit/2fec4c61267ed2c465e7411d50fbf7e9841523d5
 By using dependencyManagement, we can update the dependency version for all 
 modules at once, rather than have different versions in different modules, 
 like it was for junit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-985) Support for HTML5 elements

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-985:
---
Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Support for HTML5 elements
 --

 Key: TIKA-985
 URL: https://issues.apache.org/jira/browse/TIKA-985
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.2
Reporter: Markus Jelsma
 Fix For: 1.8

 Attachments: TIKA-985-1.3-1.patch, TIKA-985-1.3-2.patch, 
 TIKA-985-1.3-3.patch, TIKA-985-1.5.patch


 TagSoup's schema.tssl does not include some HTML5 elements (e.g. article, 
 section). This prevents some custom ContentHandlers from reading expected 
 elements and/or attributes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1343) Create a Tika Translator implementation that uses JoshuaDecoder

2014-10-24 Thread Chris A. Mattmann (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris A. Mattmann updated TIKA-1343:

Fix Version/s: (was: 1.7)
1.8

- push to 1.8

Create a Tika Translator implementation that uses JoshuaDecoder
---

Key: TIKA-1343
URL: https://issues.apache.org/jira/browse/TIKA-1343
Project: Tika
Issue Type: Bug
Components: translation
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Fix For: 1.8

The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine
translation system hosted at Github:
http://joshua-decoder.org/
Joshua takes in corpuses and trains models that can then be used to do
language translation. Currently there is support for e.g., Spanisn-English,
Indian dialects-English, Chinese-English, and a few others.
https://github.com/joshua-decoder/joshua/
It would be nice to build a Tika Translator on top of Joshua. There are of
course several issues with this:
* the models are huge - so we'll need a separate package or Maven module,
maybe tika-translate-joshua or something to release the models and we'll need
to build the models. I just went through the process of building the
Spanish-English one, and it still needs to be rebuilt b/c I did it wrong,
but it took over a day
* there is a configuration for Joshua, and so we need some way of passing
that config into the Translator. Not sure of the best way to do this.
* Joshua isn't in the Central repository. I've started a discussion on the
Joshua lists about this:
https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0
Anyhoo, I've got a working patch right now with hard code stuff, and a manual
install into my Maven repo for brave souls out there that want to try it.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1295) Make some Dublin Core items multi-valued

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1295:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Make some Dublin Core items multi-valued
 

 Key: TIKA-1295
 URL: https://issues.apache.org/jira/browse/TIKA-1295
 Project: Tika
  Issue Type: Bug
  Components: metadata
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.8


 According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, 
 dc:title, dc:description and dc:rights should allow multiple values because 
 of language alternatives.  Unless anyone objects in the next few days, I'll 
 switch those to Property.toInternalTextBag() from Property.toInternalText().  
 I'll also modify PDFParser to extract dc:rights.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1059:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
 --

 Key: TIKA-1059
 URL: https://issues.apache.org/jira/browse/TIKA-1059
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Ray Gauss II
 Fix For: 1.8


 The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch 
 {{InterruptedException}} and ignore it.
 The methods should either call {{interrupt()}} on the current thread or 
 re-throw the exception, possibly wrapped in a {{TikaException}}.
 See TIKA-775 for a previous discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1417) Create Extract Embedded Images from PDFs Example

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1417:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Create Extract Embedded Images from PDFs Example
 

 Key: TIKA-1417
 URL: https://issues.apache.org/jira/browse/TIKA-1417
 Project: Tika
  Issue Type: Improvement
  Components: example
Reporter: Tyler Palsulich
Priority: Minor
 Fix For: 1.8


 Users commonly want to turn on extraction of images embedded in PDFs (e.g. 
 TIKA-1414). Tika has the capability, but it's not clear how to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1442:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1328) Translate Metadata and Content

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1328:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Translate Metadata and Content
 --

 Key: TIKA-1328
 URL: https://issues.apache.org/jira/browse/TIKA-1328
 Project: Tika
  Issue Type: New Feature
  Components: translation
Reporter: Tyler Palsulich
 Fix For: 1.8


 Right now, Translation is only done on Strings. Ideally, users would be able 
 to turn on translation while parsing. I can think of a couple options:
 - Make a TranslateAutoDetectParser. Automatically detect the file type, parse 
 it, then translate the content.
 - Make a Context switch. When true, translate the content regardless of the 
 parser used. I'm not sure the best way to go about this method, but I prefer 
 it over another Parser.
 Regardless, we need a black or white list for translation. I think black list 
 would be the way to go -- which fields should not be translated (dates, 
 versions, ...) Any ideas? Also, somewhat unrelated, does anyone know of any 
 other open source translation libraries? If we were really lucky, it wouldn't 
 depend on an online service.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1425) Automatic batching of Microsoft service calls

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1425:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Automatic batching of Microsoft service calls
 -

 Key: TIKA-1425
 URL: https://issues.apache.org/jira/browse/TIKA-1425
 Project: Tika
  Issue Type: Improvement
  Components: translation
Affects Versions: 1.6
Reporter: Lewis John McGibbney
 Fix For: 1.8


 Right now when I use the following code I get the stack trace at the bottom 
 of this description. This seems to be because the Request URI is too large to 
 make the service request. We need to have a mechansim within the call to 
 Tika.translate which will, on a service-by-service basis, determine the 
 maximum Request URI which can be sent. I beleive that this should be on the 
 Tika side as how else am I meant to know the maximum request size?
 {code:title=translator.java|borderStyle=solid}
 +Translator translate = new MicrosoftTranslator();
 +((MicrosoftTranslator) translate).setId(...);
 +((MicrosoftTranslator) translate).setSecret(...);
  for (java.util.Map.EntryText, Parse entry : parseResult) {
Parse parse = entry.getValue();
LOG.info(-\nUrl\n---\n);
 @@ -201,7 +207,7 @@
System.out.print(parse.getData().toString());
if (dumpText) {
  LOG.info(-\nParseText\n-\n);
 -System.out.print(parse.getText());
 +System.out.print(translate.translate(parse.getText(), fr));
}
 {code}
 {code:title=stacktrace.log|borderStyle=solid}
 Exception in thread main java.lang.Exception: [microsoft-translator-api] 
 Error retrieving translation : Server returned HTTP response code: 414 for 
 URL: 
 http://api.microsofttranslator.com/V2/Ajax.svc/Translate?from=to=frtext=%D0%A4%D0...
 ...
   at 
 com.memetix.mst.MicrosoftTranslatorAPI.retrieveString(MicrosoftTranslatorAPI.java:202)
   at com.memetix.mst.translate.Translate.execute(Translate.java:61)
   at com.memetix.mst.translate.Translate.execute(Translate.java:76)
   at 
 org.apache.tika.language.translate.MicrosoftTranslator.translate(MicrosoftTranslator.java:104)
   at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:210)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:228)
 Caused by: java.io.IOException: Server returned HTTP response code: 414 for 
 URL: 
 http://api.microsofttranslator.com/V2/Ajax.svc/Translate?from=to=frtext=%D0%A4%D0%BE%D1%80%D1%83%D0%B...
 ...
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
   at 
 sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1675)
   at 
 sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1673)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1671)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1244)
   at 
 com.memetix.mst.MicrosoftTranslatorAPI.retrieveResponse(MicrosoftTranslatorAPI.java:178)
   at 
 com.memetix.mst.MicrosoftTranslatorAPI.retrieveString(MicrosoftTranslatorAPI.java:199)
   ... 6 more
 Caused by: java.io.IOException: Server returned HTTP response code: 414 for 
 URL: 
 http://api.microsofttranslator.com/V2/Ajax.svc/Translate?from=to=frtext=%D0%A4%D0%BE...
 ...
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1626)
   at 
 java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
   at 
 com.memetix.mst.MicrosoftTranslatorAPI.retrieveResponse(MicrosoftTranslatorAPI.java:177)
   ... 7 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1408) Fix version for tikadotnet to be tracked along with trunk and release version

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1408:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Fix version for tikadotnet to be tracked along with trunk and release version
 -

 Key: TIKA-1408
 URL: https://issues.apache.org/jira/browse/TIKA-1408
 Project: Tika
  Issue Type: Bug
  Components: packaging
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8


 As reported by [~thaichat04] the tikadotnet versioning doesn't match up with 
 trunk. This is because we aren't releasing this code yet and it's not part of 
 the pom.xml file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1072) AIOOBE when handling embedded document in .doc file

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1072:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 AIOOBE when handling embedded document in .doc file
 ---

 Key: TIKA-1072
 URL: https://issues.apache.org/jira/browse/TIKA-1072
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.8

 Attachments: 20-Force-on-a-current-S00.doc, Ole10NativeEntry.bin


 I have a Word (.doc) document that hits an exception when I run:
 {noformat}
 java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar 
 /x/tmp/20-Force-on-a-current-S00.doc 
 {noformat}
 Here's the exception:
 {noformat}
 Caused by: java.lang.ArrayIndexOutOfBoundsException: 40
   at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225)
   at 
 org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:139)
   at 
 org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89)
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 {noformat}
 It happens when we try to parse an OLE10 embedded object ... the code
 that does this parsing captures and ignores Ole10NativeException and
 skips the entry ... so I'm wondering if we should also catch AIOOBE
 and skip the entry?  Ie, maybe this entry really is not OLE10, and the
 Ole10Native code is failing to throw Ole10NativeException for it?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1308:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Support in memory parse mode(don't create temp file): to support run Tika in 
 GAE
 

 Key: TIKA-1308
 URL: https://issues.apache.org/jira/browse/TIKA-1308
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: yuanyun.cn
  Labels: gae
 Fix For: 1.8


 I am trying to use Tika in GAE and write a simple servlet to extract meta 
 data info from jpeg:
 String urlStr = req.getParameter(imageUrl);
 byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr));
 ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData);
 Metadata metadata = new Metadata();
 BodyContentHandler ch = new BodyContentHandler();
 AutoDetectParser parser = new AutoDetectParser();
 parser.parse(bais, ch, metadata, new ParseContext());
 bais.close();
 This fails with exception:
 Caused by: java.lang.SecurityException: Unable to create temporary file
   at java.io.File.createTempFile(File.java:1986)
   at 
 org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
 Checked the code, in 
 org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, 
 Metadata, ParseContext), it creates a temp file from the input stream.
 I can understand why tika create temp file from the stream: so tika can parse 
 it multiple times.
 But as GAE and other cloud servers are getting more popular, is it possible 
 to avoid create temp file: instead we can copy the origin stream to a 
 byteArray stream, so tika can also parse it multiple times.
 -- This will have a limit on the file size, as tika keeps the whole file in 
 memory, but this can make tika work in GAE and maybe other cloud server.
 We can add a parameter in parser.parse to indicate whether do in memory parse 
 only.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-819:
---
Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Make Option to Exclude Embedded Files' Text for Text Content
 

 Key: TIKA-819
 URL: https://issues.apache.org/jira/browse/TIKA-819
 Project: Tika
  Issue Type: New Feature
  Components: general
Affects Versions: 1.0
 Environment: Windows-7 + JDK 1.6 u26
Reporter: Albert L.
 Fix For: 1.8


 It would be nice to be able to disable text content from embedded files.
 For example, if I have a DOCX with an embedded PPTX, then I would like the 
 option to disable text from the PPTX from showing up when asking for the text 
 content from DOCX.  In other words, it would be nice to have the option to 
 get text content *only* from the DOCX instead of the DOCX+PPTX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1276:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Missing embedded dependencies in tika-bundle
 

 Key: TIKA-1276
 URL: https://issues.apache.org/jira/browse/TIKA-1276
 Project: Tika
  Issue Type: Bug
  Components: packaging
Affects Versions: 1.5
 Environment: OSGI, Apache Felix via Apache Sling Launcher
Reporter: Rupert Westenthaler
 Fix For: 1.8

 Attachments: TIKA-1276_20140423_rwesten.diff, 
 TIKA-1276_20140428_2_rwesten.diff, TIKA-1276_20140428_3_rwesten.diff, 
 TIKA-1276_20140428_rwesten.diff


 While updating from tika 1.2 to 1.5 I that the 
 `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
 1. `com.uwyn:jhighlight:1.0` is not embedded
 Because of that installing the bundle results in the following exception
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
 [103.0] osgi.wiring.package; 
 (osgi.wiring.package=com.uwyn.jhighlight.renderer)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 2. `org.ow2.asm:asm:4.1` is not embedded because 
 `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and 
 therefore the `Embed-Dependency` directive `asm` does not match any 
 dependency. 
 Because of that one do get the following exception (after fixing (1))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; 
 ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0)))
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 There are two possibilities to fix this (a) change the `Embed-Dependency` to 
 `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the 
 tika-bundle pom file.
 3. `edu.ucar:netcdf:4.2-min` is not embedded
 Because of that one does get the following exception (after fixing (1) and 
 (2))
 {code}
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
 org.osgi.framework.BundleException: Unresolved constraint in bundle 
 org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
 [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)
   at 
 org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
   at 
 org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
   at java.lang.Thread.run(Thread.java:744)
 {code}
 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime
 After fixing the above issues the tika-bundle was started successfully. 
 However when extracting EXIG metadata from a jpeg image I got the following 
 exception.
 {code}
 java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException
   at 
 com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
   at 
 com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
   at 
 org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
   [..]

[jira] [Updated] (TIKA-1390) Create tika-example module

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1390:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Create tika-example module
 --

 Key: TIKA-1390
 URL: https://issues.apache.org/jira/browse/TIKA-1390
 Project: Tika
  Issue Type: Bug
  Components: example
Reporter: Tyler Palsulich
 Fix For: 1.8


 This issue will track the initial creation of the tika-example module. 
 Subtasks will be used for the first few examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1426) Let's allow users to specify a tika config file on the commandline for tika-app and tika-server

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1426:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Let's allow users to specify a tika config file on the commandline for 
 tika-app and tika-server
 ---

 Key: TIKA-1426
 URL: https://issues.apache.org/jira/browse/TIKA-1426
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8


 It would be handy to be able to specify a tika-config file when using 
 tika-app and tika-server.  I added this capability to tika-app as part of 
 TIKA-1418.  I should have opened a separate issue at the time (mea culpa).  
 This present issue covers both tika-app and tika-server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1315) Basic list support in WordExtractor

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1315:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Basic list support in WordExtractor
 ---

 Key: TIKA-1315
 URL: https://issues.apache.org/jira/browse/TIKA-1315
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Filip Bednárik
Priority: Minor
 Fix For: 1.8

 Attachments: ListManager.tar.bz2, ListNumbering.patch, 
 ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch


 Hello guys, I am really sorry to post issue like this because I have no other 
 way of contacting you and I don't quite understand how you manage forks and 
 pull requests (I don't think you do that). Plus I don't know your coding 
 styles and stuff.
 In my project I needed for tika to parse numbered lists from word .doc 
 documents, but TIKA doesn't support it. So I looked for solution and found 
 one here: 
 http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
 Anyway feel free to use any of it so it can help people who struggle with 
 lists in TIKA like I did.
 Attached files are:
 Updated test
 Fixed WordExtractor
 Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1318) Use of Deprecated Word6Extractor.getParagraphText() Method

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1318:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Use of Deprecated Word6Extractor.getParagraphText() Method
 --

 Key: TIKA-1318
 URL: https://issues.apache.org/jira/browse/TIKA-1318
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Tyler Palsulich
Priority: Minor
  Labels: deprecation
 Fix For: 1.8


 org.apache.tika.parser.microsoft.WordExtractor.parseWord6() uses the 
 deprecated Word6Extractor.getParagraphText() method. getParagraphText() is 
 supposed to return a String[] with an element for each paragraph in the text. 
 The replacement is getText(), which lets paragraph, cell, etc separation be 
 implementation specific. I'm not sure, at this point, how the POI 
 WordExtractor separates them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1106) CLAVIN Integration

2014-10-24 Thread Chris A. Mattmann (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris A. Mattmann updated TIKA-1106:

Fix Version/s: (was: 1.7)
1.8

- push to 1.8

CLAVIN Integration
--

Key: TIKA-1106
URL: https://issues.apache.org/jira/browse/TIKA-1106
Project: Tika
Issue Type: Wish
Components: general
Affects Versions: 1.3
Environment: All
Reporter: Adam Estrada
Priority: Minor
Labels: entity, geospatial
Fix For: 1.8

I've been evaluating CLAVIN as a way to extract location information from
unstructured text. It seems like meshing it with Tika in some way would make
a lot of sense. From CLAVIN website...
{quote}
CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source
software package for document geotagging and geoparsing that employs
context-based geographic entity resolution. It combines a variety of open
source tools with natural language processing techniques to extract location
names from unstructured text documents and resolve them against gazetteer
records. Importantly, CLAVIN does not simply look up location names;
rather, it uses intelligent heuristics in an attempt to identify precisely
which Springfield (for example) was intended by the author, based on the
context of the document. CLAVIN also employs fuzzy search to handle
incorrectly-spelled location names, and it recognizes alternative names
(e.g., Ivory Coast and Côte d'Ivoire) as referring to the same geographic
entity. By enriching text documents with structured geo data, CLAVIN enables
hierarchical geospatial search and advanced geospatial analytics on
unstructured data.
{quote}
There was only one other instance of the word clavin mentioned in the ASF
jira site so I thought it was definitely worth posting here.
https://github.com/Berico-Technologies/CLAVIN

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1269) Self-hosted documentation for the JAX-RS Server

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1269:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Self-hosted documentation for the JAX-RS Server
 ---

 Key: TIKA-1269
 URL: https://issues.apache.org/jira/browse/TIKA-1269
 Project: Tika
  Issue Type: Improvement
  Components: server
Affects Versions: 1.5
Reporter: Nick Burch
 Fix For: 1.8

 Attachments: TIKA-1269-miredot.patch, enable-enunciate.patch


 Currently, if you fire up the JAX-RS Tika Server, and go to the root of the 
 server in a web browser, you get an empty page back. You have to know to head 
 over to https://wiki.apache.org/tika/TikaJAXRS find out what the available 
 URLs are
 We should self-host some simple documentation on the server at the root of 
 it, so that people can discover what it offers. Ideally, this should be 
 largely auto-generated based on the endpoints, so that we don't risk missing 
 things when we add new features
 This will also allow us to potentially offer a sample running version of the 
 server for people to discover Tika with



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1395) Create embedded image extraction example

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1395:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Create embedded image extraction example
 

 Key: TIKA-1395
 URL: https://issues.apache.org/jira/browse/TIKA-1395
 Project: Tika
  Issue Type: Sub-task
  Components: example
Reporter: Tyler Palsulich
Priority: Minor
 Fix For: 1.8


 Create an example of how to turn do embedded image extraction and parsing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1383) Simplify TikeServerCli endpoint setup code

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1383:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Simplify TikeServerCli endpoint setup code
 --

 Key: TIKA-1383
 URL: https://issues.apache.org/jira/browse/TIKA-1383
 Project: Tika
  Issue Type: Improvement
  Components: server
Reporter: Sergey Beryozkin
Assignee: Sergey Beryozkin
Priority: Trivial
 Fix For: 1.8






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1167) Embedded object not extracted

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1167:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Embedded object not extracted
 -

 Key: TIKA-1167
 URL: https://issues.apache.org/jira/browse/TIKA-1167
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Daniel Bonniot de Ruisselet
Priority: Critical
 Fix For: 1.8

 Attachments: Doc w Structure that wont extract.docx


 For the attached docx, tika seems to detect the embedded object, as shown by 
 this tag:
 {{div class=embedded id=rId10/}}
 However, extraction itself (using -z on the command line, or using the API) 
 does not seem to work for this object:
 {{java -jar tika-app-1.4.jar -z Doc\ w\ Structure\ that\ wont\ extract.docx}}
 {{Extracting 'rId9_image1.wmf' (application/x-msmetafile) to 
 /tmp/tika/rId9_image1.wmf}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-995) XHTMLContentHandler doesn't pass attributes of body element

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-995:
---
Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 XHTMLContentHandler doesn't pass attributes of body element
 ---

 Key: TIKA-995
 URL: https://issues.apache.org/jira/browse/TIKA-995
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
Reporter: Markus Jelsma
 Fix For: 1.8

 Attachments: TIKA-995-1.3-1.patch, TIKA-995-unit.patch


 XHTMLContentHandler.startElement() uses lazyHead() for the body element 
 because it's defined in the AUTO Set. As a consequence, attributes of the 
 body element are not passed to downstream content handlers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1307) Jenkins Java7 job requires a profile in order to build 'tika-java7' module.

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1307:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Jenkins Java7 job requires a profile in order to build 'tika-java7' module.
 ---

 Key: TIKA-1307
 URL: https://issues.apache.org/jira/browse/TIKA-1307
 Project: Tika
  Issue Type: Bug
  Components: packaging
Affects Versions: 1.5
Reporter: Lewis John McGibbney
 Fix For: 1.8


 N.B. Can someone please create a *build* tag in Admin area? The assign it to 
 this issue?
 This issue was flagged up by Hong-Thai during the DISCUSS nightly builds 
 thread recently
 http://www.mail-archive.com/dev%40tika.apache.org/msg07963.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1366) Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1366:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse 
 

 Key: TIKA-1366
 URL: https://issues.apache.org/jira/browse/TIKA-1366
 Project: Tika
  Issue Type: Improvement
  Components: server
Reporter: Sergey Beryozkin
Priority: Minor
 Fix For: 1.8


 Some of Tika Server services will benefit from optionally supporting JAX-RS 
 2.0 AsyncResponse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1108) Represent individual slides in pptx

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1108:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Represent individual slides in pptx
 ---

 Key: TIKA-1108
 URL: https://issues.apache.org/jira/browse/TIKA-1108
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Daniel Bonniot de Ruisselet
 Fix For: 1.8


 When parsing ppt, tika produces for each slide:
 div class=slide
 However for pptx these seem to be missing, all the text is directly under 
 body.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1079) Word document hits AIOOBE in SummaryExtractor.parseSummaries

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1079:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Word document hits AIOOBE in SummaryExtractor.parseSummaries
 

 Key: TIKA-1079
 URL: https://issues.apache.org/jira/browse/TIKA-1079
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.8

 Attachments: guide_to_daips_(id_3152_ver_1.0.0).doc


 I'm not yet sure if this is a corrupted document (though, MS Word opens it 
 just fine) or a bug in POI ... but I hit this exc when running it through 
 TikaCLI:
 {noformat}
 java.lang.ArrayIndexOutOfBoundsException: -1
   at org.apache.poi.hpsf.CodePageString.init(CodePageString.java:161)
   at 
 org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:158)
   at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:163)
   at org.apache.poi.hpsf.Property.init(Property.java:164)
   at org.apache.poi.hpsf.Section.init(Section.java:277)
   at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:451)
   at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:246)
   at 
 org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:78)
   at 
 org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:69)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-539) Encoding detection is too biased by encoding in meta tag

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-539:
---
Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Encoding detection is too biased by encoding in meta tag
 

 Key: TIKA-539
 URL: https://issues.apache.org/jira/browse/TIKA-539
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 0.8, 0.9, 0.10
Reporter: Reinhard Schwab
Assignee: Ken Krugler
 Fix For: 1.8

 Attachments: TIKA-539.patch, TIKA-539_2.patch


 if the encoding in the meta tag is wrong, this encoding is detected,
 even if there is the right encoding set in metadata before(which can be  from 
 http response header).
 test code to reproduce:
 static String content = htmlhead\n
   + meta http-equiv=\content-type\ 
 content=\application/xhtml+xml; charset=iso-8859-1\ /
   + /headbodyÜber den Wolken\n/body/html;
   /**
* @param args
* @throws IOException
* @throws TikaException
* @throws SAXException
*/
   public static void main(String[] args) throws IOException, SAXException,
   TikaException {
   Metadata metadata = new Metadata();
   metadata.set(Metadata.CONTENT_TYPE, text/html);
   metadata.set(Metadata.CONTENT_ENCODING, UTF-8);
   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
   InputStream in = new 
 ByteArrayInputStream(content.getBytes(UTF-8));
   AutoDetectParser parser = new AutoDetectParser();
   BodyContentHandler h = new BodyContentHandler(1);
   parser.parse(in, h, metadata, new ParseContext());
   System.out.print(h.toString());
   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1423) Build a parser to extract data from GRIB formats

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1423:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Build a parser to extract data from GRIB formats
 

 Key: TIKA-1423
 URL: https://issues.apache.org/jira/browse/TIKA-1423
 Project: Tika
  Issue Type: New Feature
  Components: metadata, mime, parser
Affects Versions: 1.6
Reporter: Vineet Ghatge
Assignee: Vineet Ghatge
Priority: Critical
  Labels: features, newbie
 Fix For: 1.8

 Attachments: GribParser.java, 
 NLDAS_FORA0125_H.A20130112.1200.002.grb, fileName.html, 
 gdas1.forecmwf.2014062612.grib2


 Arctic dataset contains a MIME format called GRIB -  General 
 Regularlydistributed information in Binary form 
 http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
 a concise data format used in meteorology to store historical and 
 weather data. There are 2 different types of the format  GRIB 0, GRIB 2.  
 The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
 intended for either transmission or storage contains a single parameter with 
 values located at an array of grid points, or represented as a set of 
 spectral coefficients, for a single level (or layer), encoded as a continuous 
 bit stream. Logical divisions of the record are designated as sections, 
 each of which provides control information and/or data. A GRIB record 
 consists of six sections, two of which are optional: 
  
 (0) Indicator Section 
 (1) Product Definition Section (PDS) 
 (2) Grid Description Section (GDS)  optional 
 (3) Bit Map Section (BMS)  optional 
 (4) Binary Data Section (BDS) 
 (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1379) error in Tika().detect for xml files with xades signature

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1379:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 error in Tika().detect for xml files with xades signature
 -

 Key: TIKA-1379
 URL: https://issues.apache.org/jira/browse/TIKA-1379
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 1.4
Reporter: Alessandro De Angelis
 Fix For: 1.8


 we tried to get the mime type of an xml file with xades signature embedded. 
 the result is text/html and not the expected text/xml or 
 application/xml.
 here is an example of the xml file:
 VERBALI ad_cod=D69017 batch_id=0 cds_cod=D69 data_app=2013-09-23
 VERBALE Id=1 tipologia=Verbale esame
   VERB_NUM00094853 0003 2/VERB_NUM
   DATA_APP2013-09-23/DATA_APP
   DATA_ESA2013-09-23/DATA_ESA
   AD_CODD69017/AD_COD
   ADFILOSOFIA DELLA SCIENZA/AD
   CDS_CODD69/CDS_COD
   CDSTEATRO E ARTI VISIVE/CDS
   TIPO_ESA/TIPO_ESA
   MAT1233456/MAT
   NOMEPAOLINO/NOME
   COGNOMEPAPERINO/COGNOME
   VOTO23.0/VOTO
   VOTODECOD23/VOTODECOD
   CAUSALE/CAUSALE
   TIPO_MODULO/TIPO_MODULO
   IMG_PATH/IMG_PATH
   AA_SES_ID2012/AA_SES_ID
   AD_CFU6.0/AD_CFU
   NOTA/NOTA
   ATENEO9/ATENEO
   ATENEO_DESجامعة البندقية - TEST/ATENEO_DES
   TIPO_DOCUMENTOVerbale_3/TIPO_DOCUMENTO
   TITOLARE_PROCEDIMENTOQUI QUO QUA/TITOLARE_PROCEDIMENTO
   AD_STU_CODD69017/AD_STU_COD
   AD_STUFILOSOFIA DELLA SCIENZA/AD_STU
   CDS_STU_CODD69/CDS_STU_COD
   CDS_STUTEATRO E ARTI VISIVE/CDS_STU
   DOCENTEQUI QUO QUA/DOCENTE
 DATA_DOCUMENTO26-09-2013 09:55:53 CEST(+0200)/DATA_DOCUMENTO
 SOFTWARE_DI_CREAZIONE
   NOME3/NOME
   VERSIONE11.09.03/VERSIONE
 /SOFTWARE_DI_CREAZIONE
 /VERBALEds:Signature xmlns:ds=http://www.w3.org/2000/09/xmldsig#; 
 Id=sig08744308748201048377
 ds:SignedInfo
 ds:CanonicalizationMethod 
 Algorithm=http://www.w3.org/2006/12/xml-c14n11;/ds:CanonicalizationMethod
 ds:SignatureMethod 
 Algorithm=http://www.w3.org/2001/04/xmldsig-more#rsa-sha256;/ds:SignatureMethod
 ds:Reference URI=
 ds:Transforms
 ds:Transform Algorithm=http://www.w3.org/2002/06/xmldsig-filter2;
 dsig-xpath:XPath 
 xmlns:dsig-xpath=http://www.w3.org/2002/06/xmldsig-filter2; 
 Filter=subtract/descendant::ds:Signature/dsig-xpath:XPath
 /ds:Transform
 ds:Transform Algorithm=http://www.w3.org/TR/1999/REC-xslt-19991116;
 xsl:stylesheet xmlns:kion=http://www.kion.it/webesse3/multilingua; 
 xmlns:xsl=http://www.w3.org/1999/XSL/Transform; 
 exclude-result-prefixes=kion version=1.0
   kion:ml module=FirmaDigitale target=kion/kion:ml
   xsl:output method=xml/xsl:output
   xsl:variable name=mostra_ad_figlie select=1/xsl:variable
   xsl:variable name=verbale_root 
 select=/VERBALI/VERBALE/xsl:variable
   xsl:variable name=sostituzione_root 
 select=/VERBALI/VERBALE/SOSTITUZIONE_DOCUMENTO/xsl:variable
   xsl:variable name=RAGG_ROOT 
 select=/VERBALI/VERBALE/RAGGRUPPAMENTO/xsl:variable
   xsl:variable name=COMM_ROOT 
 select=/VERBALI/VERBALE/COMMISSIONE/xsl:variable
   
   xsl:template match=/
   html
   head
   meta content=text/html;charset=UTF-8 
 http-equiv=Content-Type/meta
   xsl:choose 
   xsl:when 
 test=$sostituzione_root
   titleDichiarazione 
 conformità Verbale Esame/title
   /xsl:when
   xsl:otherwise
   titleVerbalizzazione 
 esame/title
   /xsl:otherwise
   /xsl:choose
   style type=text/css
td  {font-family: Arial; font-size:10pt;} 
div {font-family: Arial; font-size:10pt;}
pre {font-family: Arial; font-size:10pt;} 
   /style
   /head
   body
   table
   xsl:choose 
   xsl:when 
 test=$sostituzione_root
   trtd align=center 
 colspan=2bigstrongxsl:value-of 
 select=$verbale_root/ATENEO_DES/xsl:value-of/strong/bigbr/br/td/tr
   trtd align=center 
 colspan=2bigstrongDICHIARAZIONE DI 
 CONFORMITÀ/strong/bigbr/br/td/tr

[jira] [Updated] (TIKA-1435) Update rome dependency to 1.5

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1435:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Update rome dependency to 1.5
 -

 Key: TIKA-1435
 URL: https://issues.apache.org/jira/browse/TIKA-1435
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Johannes Mockenhaupt
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.8

 Attachments: netcdf-deps-changes.diff


 Rome 1.5 has been released to Sonatype 
 (https://github.com/rometools/rome/issues/183). Though the website 
 (http://rometools.github.io/rome/) is blissfully ignorant of that. The update 
 is mostly maintenance, adopting slf4j and generics as well as moving the 
 namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1456) Visual Sentiment API parser

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1456:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Visual Sentiment API parser
 ---

 Key: TIKA-1456
 URL: https://issues.apache.org/jira/browse/TIKA-1456
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8


 Integrate the Visual Sentibank API as a parser for images. We can use 
 Aperture from CMU, it's released under the MIT license:
 https://github.com/d8w/aperture



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1301) Establish TikaServer on Apache hosted VM

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1301:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Establish TikaServer on Apache hosted VM
 

 Key: TIKA-1301
 URL: https://issues.apache.org/jira/browse/TIKA-1301
 Project: Tika
  Issue Type: Bug
  Components: server
Reporter: Lewis John McGibbney
 Fix For: 1.8


 Over in Any23, Infra recently provisioned us with a nice shiny new VM to run 
 our service on
 http://any23.org
 I would like to do the same for Tika. I have some scripts on the Any23 VM 
 which will pull stable nightly tika-server snapshots and deploy them to the 
 VM. This is really nice for both dev's and users alike.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-988) We don't extract a placeholder for a Word document embedded in an Excel document

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-988:
---
Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 We don't extract a placeholder for a Word document embedded in an Excel 
 document
 

 Key: TIKA-988
 URL: https://issues.apache.org/jira/browse/TIKA-988
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.8

 Attachments: bug31373.xls


 In TIKA-956 we fixed the Word parser so that at the point where an embedded 
 document appears, we output a div class=embedded id=_XXX/ tag.
 It would be nice to do this for documents embedded in Excel too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1387) Add forbidden-apis checker to TIKA build

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1387:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Add forbidden-apis checker to TIKA build
 

 Key: TIKA-1387
 URL: https://issues.apache.org/jira/browse/TIKA-1387
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Uwe Schindler
Assignee: Tyler Palsulich
 Fix For: 1.8

 Attachments: TIKA-1387.palsulich.080614.patch, TIKA-1387.patch, 
 TIKA-1387.patch, TIKA-1387.patch


 Lucene and many other projects already use the forbidden-apis checker to 
 prevent use of some broken classes/signatures from the JDK. These are 
 especially thing using default character sets or default locales. The 
 forbidden-api checker can also be used to explcitely disallow specific 
 methods, if they have security issues (e.g., creating XML parsers without 
 disabling external entity support).
 The attached patch adds the forbidden-api checker to the tika-parent pom file 
 with default configuration.
 Running it fails with many errors in TIKA core already:
 {noformat}
 [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core ---
 [INFO] Scanning for classes to check...
 [INFO] Reading bundled API signatures: jdk-unsafe
 [INFO] Reading bundled API signatures: jdk-deprecated
 [INFO] Loading classes to check...
 [INFO] Scanning for API signatures and dependencies...
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.language.LanguageProfilerBuilder 
 (LanguageProfilerBuilder.java:407)
 [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses 
 default locale]
 [ERROR]   in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:257)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:395)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:416)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:438)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:532)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:550)
 [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:588)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:656)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:782)
 [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses 
 default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:851)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:957)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.io.IOUtils (IOUtils.java:1064)
 [ERROR] Forbidden method invocation: 
 java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.sax.WriteOutContentHandler 
 (WriteOutContentHandler.java:93)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser 
 (ExternalParser.java:234)
 [ERROR] Forbidden method invocation: 
 java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset]
 [ERROR]   in org.apache.tika.parser.external.ExternalParser$3 
 (ExternalParser.java:294)
 [ERROR] Forbidden method invocation: 
 java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time 
 zone]
 [ERROR]   in org.apache.tika.utils.DateUtils (DateUtils.java:83)
 [ERROR] Forbidden method invocation: 
 java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default 
 locale]
 [ERROR]   in org.apache.tika.utils.DateUtils (DateUtils.java:91)
 [ERROR] Forbidden method invocation: java.lang.String#toLowerCase() [Uses 
 default locale]

[jira] [Updated] (TIKA-987) Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-987:
---
Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted
 

 Key: TIKA-987
 URL: https://issues.apache.org/jira/browse/TIKA-987
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.8

 Attachments: picture.doc, picture_3.doc


 I have two Word docs, both containing the same drawing, but one has
 text added.
 In one case (picture.doc) the extraction is correct: it contains only
 an embedded image.wmf; when I view the image it's correct.
 In the second case (picture_3.doc) the picture is extracted as image
 (no extension), and is 0 bytes, and there is an invalid character
 (mapped to unicode replacement char) inserted before the image:
 {noformat}
 title/
 /head
 bodyp�img src=embedded:image1 alt=image1//p
 p/
 p/
 pvehicle
 /p
 {noformat}
 (Though, the text vehicle is extracted correctly).
 I dug a bit, and with the 2nd doc there is an embedded {SHAPE *
 MERGEFORMAT} field, which we invoke
 WordExtractor.handleSpecialCharacterRuns on, and somehow it extracts
 the 0-byte no-extension image as well as the invalid character.  With
 the first doc there is no field (at least not one that's handle with
 handleSpecialCharacterRuns...).  Otherwise I'm not sure how to
 fix... it could be something is going wrong in how POI parses the
 Pictures from PictureSource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1416) Refactor Translator Exception Handling

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1416:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Refactor Translator Exception Handling
 --

 Key: TIKA-1416
 URL: https://issues.apache.org/jira/browse/TIKA-1416
 Project: Tika
  Issue Type: Bug
  Components: translation
Reporter: Tyler Palsulich
 Fix For: 1.8


 `Translator.translate()` currently throws `Exception`. We should make it more 
 specific. The only real limitation comes from MicrosoftTranslator -- the 
 library used throws `Exception`, but that shouldn't mean Tika does too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-776) ExifTool Embedder

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-776:
---
Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 ExifTool Embedder
 -

 Key: TIKA-776
 URL: https://issues.apache.org/jira/browse/TIKA-776
 Project: Tika
  Issue Type: New Feature
  Components: metadata
Affects Versions: 1.0
 Environment: ExifTool is required 
 (http://www.sno.phy.queensu.ca/~phil/exiftool/)
Reporter: Ray Gauss II
  Labels: embed, exiftool, patch
 Fix For: 1.8

 Attachments: tika-parsers-exiftool-embed-patch.txt


 This patch adds an ExifTool ExternalEmbedder which builds upon the work in 
 issue TIKA-774 and TIKA-775.
 In the tika-parsers an ExiftoolExternalEmbedder is added which extends 
 ExternalEmbedder to programmatically create an Embedder which calls the 
 ExifTool command line to embed tika metadata into a file stream and an 
 ExiftoolExternalEmbedderTest unit test is added which embeds several IPTC and 
 XMP fields then parses the resulting file stream to verify the operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies

2014-10-24 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1367:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Tika documentation should list tika-parsers parser dependencies
 ---

 Key: TIKA-1367
 URL: https://issues.apache.org/jira/browse/TIKA-1367
 Project: Tika
  Issue Type: Improvement
  Components: documentation
Reporter: Sergey Beryozkin
 Fix For: 1.8


 tika-parsers module has many strong transitive parser dependencies. Maven 
 users of tika-parsers have to exclude all the transitivie dependencies 
 manually. Documenting the list of the existing transitive dependencies and 
 keeping the list up to date will help developers exclude the libraries not 
 needed for a given project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

64 matches

Mail list logo