[jira] [Created] (TIKA-1456) Visual Sentiment API parser
Chris A. Mattmann created TIKA-1456: --- Summary: Visual Sentiment API parser Key: TIKA-1456 URL: https://issues.apache.org/jira/browse/TIKA-1456 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Integrate the Visual Sentibank API as a parser for images. We can use Aperture from CMU, it's released under the MIT license: https://github.com/d8w/aperture -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173983#comment-14173983 ] Tilman Hausherr edited comment on TIKA-1442 at 10/24/14 11:02 AM: -- After some more research, I was able to decode 5 more files (the cause was not the LZW filter, see PDFBOX-2296, but I fixed this only in 2.0). However 7 other files are really corrupt, portions of the files are blank when shown in AR: 115/115269.pdf 211/211876.pdf 268/268346.pdf 389/389474.pdf 443/443752.pdf 698/698813.pdf 846/846759.pdf was (Author: tilman): After some more research, I was able to decode 5 more files (the cause was not the LZW filter, see ). However 7 other files are really corrupt, portions of the files are blank when shown in AR: 115/115269.pdf 211/211876.pdf 268/268346.pdf 389/389474.pdf 443/443752.pdf 698/698813.pdf 846/846759.pdf Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.7 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1451) Add Recursive Metadata Parser Wrapper output to tika-app and gui
[ https://issues.apache.org/jira/browse/TIKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182674#comment-14182674 ] Tim Allison commented on TIKA-1451: --- Thank you, Chris. The credit goes to [~jukkaz] and [~gagravarr] for the recursive parser example! I'm grateful to now have an out-of-the-box format (w/ serializers and deserializers) that captures embedded document metadata. As I was working on this, I was starting to think that we might want to add some tika: prefixed properties to TikaCoreProperties to capture metadata generated during processing, such as: tika:content, tika:parse_time_millis, tika:exception, tika:parsed_by (instead of our current X-Parsed-By). In effect, move the RecursiveParserWrapper properties to TikaCoreProperties and add some others as necessary. Add Recursive Metadata Parser Wrapper output to tika-app and gui Key: TIKA-1451 URL: https://issues.apache.org/jira/browse/TIKA-1451 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.7 Attachments: integrate_recursive_metadata_wrapper.patch It would be helpful to expose the output of the recursive metadata parser wrapper in the gui and in the command line for tika-app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: import (re)ordering?
Y, I'll try to be more careful about separating out formatting from content in the future (apologies for TIKA-1451). What I didn't want to do was start an IDE war if others have different settings that will order imports in a different way. Thank you! -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Friday, October 24, 2014 1:53 AM To: dev@tika.apache.org Subject: Re: import (re)ordering? Hey Tim, No big objections from me, but it will dilute things so glad we have it noted if it happens. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Allison, Timothy B. talli...@mitre.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Tuesday, October 21, 2014 at 1:59 PM To: dev@tika.apache.org dev@tika.apache.org Subject: import (re)ordering? All, I have Intellij set to order imports by javax, java, then other. I think this is the most common pattern in Tika. Is it ok if I make these (meaningless/formatting) changes when I commit other changes? Thank you. Best, Tim
RE: import (re)ordering?
Thanks, Tim. I'll be sure to update my settings for this. On a similar note, can we standardize the formatting of the pom.xml files? Right now, they are pretty irregular. Tyler On Oct 24, 2014 10:52 AM, Allison, Timothy B. talli...@mitre.org wrote: Y, I'll try to be more careful about separating out formatting from content in the future (apologies for TIKA-1451). What I didn't want to do was start an IDE war if others have different settings that will order imports in a different way. Thank you! -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Friday, October 24, 2014 1:53 AM To: dev@tika.apache.org Subject: Re: import (re)ordering? Hey Tim, No big objections from me, but it will dilute things so glad we have it noted if it happens. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Allison, Timothy B. talli...@mitre.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Tuesday, October 21, 2014 at 1:59 PM To: dev@tika.apache.org dev@tika.apache.org Subject: import (re)ordering? All, I have Intellij set to order imports by javax, java, then other. I think this is the most common pattern in Tika. Is it ok if I make these (meaningless/formatting) changes when I commit other changes? Thank you. Best, Tim
RE: import (re)ordering?
On Fri, 24 Oct 2014, Allison, Timothy B. wrote: Y, I'll try to be more careful about separating out formatting from content in the future (apologies for TIKA-1451). What I didn't want to do was start an IDE war if others have different settings that will order imports in a different way. I'd say just pick something sensible, and then document it for everyone in http://tika.apache.org/contribute.html so it's clear what to do! Nick
[jira] [Resolved] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1422. --- Resolution: Fixed Fixed in r1634094. Skip over the two failing checks if Tesseract is installed. org.apache.tika.parser.mail.RFC822ParserTest fails -- Key: TIKA-1422 URL: https://issues.apache.org/jira/browse/TIKA-1422 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1422.Mattmann.100114.patch.txt, TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch I'm seeing test failures from: {noformat} Results : Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..) Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 {noformat} CentOS6 VM image, running: {noformat} [mattmann@memex tika]$ java -version java version 1.7.0_67 Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) [mattmann@memex tika]$ mvn -version Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_65, vendor: Oracle Corporation Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: amd64, family: unix [mattmann@memex tika]$ {noformat} Here are the surefire reports - no clue what's up here: {noformat} [mattmann@memex tika]$ more tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt --- Test set: org.apache.tika.parser.mail.RFC822ParserTest --- Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec FAILURE! testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.152 sec FAILURE! org.mockito.exceptions.verification.TooManyActualInvocations: xHTMLContentHandler.startElement( http://www.w3.org/1999/xhtml;, div, div, isA(org.xml.sax.Attributes) ); Wanted 4 times but was 5 at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) Caused by: org.mockito.exceptions.cause.UndesiredInvocation: Undesired invocation: at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183103#comment-14183103 ] Hudson commented on TIKA-1422: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #282 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/282/]) TIKA-1422. Skip checking the number of some handler invocations in the RFC822ParserTest if Tesseract is installed. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1634094) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java org.apache.tika.parser.mail.RFC822ParserTest fails -- Key: TIKA-1422 URL: https://issues.apache.org/jira/browse/TIKA-1422 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1422.Mattmann.100114.patch.txt, TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch I'm seeing test failures from: {noformat} Results : Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..) Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 {noformat} CentOS6 VM image, running: {noformat} [mattmann@memex tika]$ java -version java version 1.7.0_67 Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) [mattmann@memex tika]$ mvn -version Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_65, vendor: Oracle Corporation Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: amd64, family: unix [mattmann@memex tika]$ {noformat} Here are the surefire reports - no clue what's up here: {noformat} [mattmann@memex tika]$ more tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt --- Test set: org.apache.tika.parser.mail.RFC822ParserTest --- Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec FAILURE! testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.152 sec FAILURE! org.mockito.exceptions.verification.TooManyActualInvocations: xHTMLContentHandler.startElement( http://www.w3.org/1999/xhtml;, div, div, isA(org.xml.sax.Attributes) ); Wanted 4 times but was 5 at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) Caused by: org.mockito.exceptions.cause.UndesiredInvocation: Undesired invocation: at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183161#comment-14183161 ] Hudson commented on TIKA-1422: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #262 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/262/]) TIKA-1422. Skip checking the number of some handler invocations in the RFC822ParserTest if Tesseract is installed. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1634094) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java org.apache.tika.parser.mail.RFC822ParserTest fails -- Key: TIKA-1422 URL: https://issues.apache.org/jira/browse/TIKA-1422 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1422.Mattmann.100114.patch.txt, TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch I'm seeing test failures from: {noformat} Results : Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..) Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 {noformat} CentOS6 VM image, running: {noformat} [mattmann@memex tika]$ java -version java version 1.7.0_67 Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) [mattmann@memex tika]$ mvn -version Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_65, vendor: Oracle Corporation Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: amd64, family: unix [mattmann@memex tika]$ {noformat} Here are the surefire reports - no clue what's up here: {noformat} [mattmann@memex tika]$ more tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt --- Test set: org.apache.tika.parser.mail.RFC822ParserTest --- Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec FAILURE! testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.152 sec FAILURE! org.mockito.exceptions.verification.TooManyActualInvocations: xHTMLContentHandler.startElement( http://www.w3.org/1999/xhtml;, div, div, isA(org.xml.sax.Attributes) ); Wanted 4 times but was 5 at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) Caused by: org.mockito.exceptions.cause.UndesiredInvocation: Undesired invocation: at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Re: 1.7 release?
Hi Tyler, don't mention. Cheers, Oleg On Oct 24, 2014 8:02 PM, Tyler Palsulich tpalsul...@gmail.com wrote: Thank you for the help, Oleg! I just resolved TIKA-1422. So, are there any other issues anyone would like to resolve before a new release? Thanks, Tyler On Tue, Oct 21, 2014 at 2:42 AM, Oleg Tikhonov olegtikho...@gmail.com wrote: Sorry!!! On Tue, Oct 21, 2014 at 9:37 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Thanks Oleg, will try tomorrow for me Los angeles time! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Oleg Tikhonov o...@apache.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, October 20, 2014 at 11:20 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Please take a try with newest patch. Cheers, Oleg On Tue, Oct 21, 2014 at 9:08 AM, Oleg Tikhonov olegtikho...@gmail.com wrote: Taken. Thanks. in progress ... On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Trunk is the current checkout/branch: http://svn.apache.org/repos/asf/tika/trunk ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Oleg Tikhonov olegtikho...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, October 20, 2014 at 10:16 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Hi, I can try this on. What is a trunk? Thanks, Oleg On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hmm any idea why this is failing on Windows? Tyler P. and I were talking the other day - maybe we shouldn't run the tests from TIKA-1422 unless Tesseract is installed? Thoughts? ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Hong-Thai Nguyen thaicha...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, October 16, 2014 at 2:03 AM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Hi Andrzej, We are impatient for 1.7 release too. I'm having compiling problem of TIKA-1422 on me. If anyone can build successfully on Windows, I have no objection to release 1.7 Thanks, On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki a...@getopt.org wrote: Hi, Any news on the 1.7 release? or at least a 1.6.1 release that includes the fix for broken ODF parsing... --- Best regards, Andrzej Bialecki -- -- Hong-Thai
[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183338#comment-14183338 ] Tim Allison commented on TIKA-1442: --- Hmmm...I can't explain those files, and I recently did some cleanup so I don't have the original 1.8.6 output. When I recently reran with the latest Tika trunk, I got the same number of metadata values for those files with PDFBox 1.8.6 and 1.8.8-SNAPSHOT (vintage 2 days ago). All the problematic files have attachments. I wonder if recent work on the OCR parser could explain this. [~tpalsulich], over the last few weeks, was there a time when we were extracting metadata from images, but now we're not? For 224644.pdf, for example, there doesn't seem to be much metadata for the jpgs now...a total of 40 metadata values for the full document. Last week, when I ran Tika, there were 160, metadata values. {noformat} {Content-Length:5970,Content-Type:image/jpeg,X-Parsed-By:[org.apache.tika.parser.DefaultParser,org.apache.tika.parser.ocr.TesseractOCRParser],embeddedResourceType:ATTACHMENT,resourceName:arrow.jpg,tika:embedded_resource_path:224644.pdf/arrow.jpg},{Content-Length:5970,Content-Type:image/jpeg,X-Parsed-By:[org.apache.tika.parser.DefaultParser,org.apache.tika.parser.ocr.TesseractOCRParser],embeddedResourceType:ATTACHMENT,resourceName:arrow.jpg,tika:embedded_resource_path:224644.pdf/arrow.jpg}] {noformat} In short, [~tilman], I don't think this is a PDFBox issue. Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.7 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183338#comment-14183338 ] Tim Allison edited comment on TIKA-1442 at 10/24/14 7:22 PM: - Hmmm...I can't explain those files, and I recently did some cleanup so I don't have the original 1.8.6 output. When I recently reran with the latest Tika trunk, I got the same number of metadata values for those files with PDFBox 1.8.6 and 1.8.8-SNAPSHOT (vintage 2 days ago). All the problematic files have attachments. I wonder if recent work on the OCR parser could explain this. [~tpalsulich], over the last few weeks, was there a time when we were extracting metadata from images, but now we're not? For 224644.pdf, for example, there doesn't seem to be much metadata for the jpgs now...a total of 40 metadata values for the full document. Last week, when I ran Tika, there were 160, metadata values. {noformat} {Content-Length:5970,Content-Type:image/jpeg, X-Parsed-By:[org.apache.tika.parser.DefaultParser,org.apache.tika.parser.ocr.TesseractOCRParser], embeddedResourceType:ATTACHMENT,resourceName:arrow.jpg, tika:embedded_resource_path:224644.pdf/arrow.jpg}, {Content-Length:5970,Content-Type:image/jpeg, X-Parsed-By:[org.apache.tika.parser.DefaultParser,org.apache.tika.parser.ocr.TesseractOCRParser], embeddedResourceType:ATTACHMENT,resourceName:arrow.jpg, tika:embedded_resource_path:224644.pdf/arrow.jpg}] {noformat} In short, [~tilman], I don't think this is a PDFBox issue. was (Author: talli...@mitre.org): Hmmm...I can't explain those files, and I recently did some cleanup so I don't have the original 1.8.6 output. When I recently reran with the latest Tika trunk, I got the same number of metadata values for those files with PDFBox 1.8.6 and 1.8.8-SNAPSHOT (vintage 2 days ago). All the problematic files have attachments. I wonder if recent work on the OCR parser could explain this. [~tpalsulich], over the last few weeks, was there a time when we were extracting metadata from images, but now we're not? For 224644.pdf, for example, there doesn't seem to be much metadata for the jpgs now...a total of 40 metadata values for the full document. Last week, when I ran Tika, there were 160, metadata values. {noformat} {Content-Length:5970,Content-Type:image/jpeg,X-Parsed-By:[org.apache.tika.parser.DefaultParser,org.apache.tika.parser.ocr.TesseractOCRParser],embeddedResourceType:ATTACHMENT,resourceName:arrow.jpg,tika:embedded_resource_path:224644.pdf/arrow.jpg},{Content-Length:5970,Content-Type:image/jpeg,X-Parsed-By:[org.apache.tika.parser.DefaultParser,org.apache.tika.parser.ocr.TesseractOCRParser],embeddedResourceType:ATTACHMENT,resourceName:arrow.jpg,tika:embedded_resource_path:224644.pdf/arrow.jpg}] {noformat} In short, [~tilman], I don't think this is a PDFBox issue. Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.7 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183383#comment-14183383 ] Tyler Palsulich commented on TIKA-1442: --- Yes, unfortunately. Please see TIKA-1445. [~mattmann], any thoughts? Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.7 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183383#comment-14183383 ] Tyler Palsulich edited comment on TIKA-1442 at 10/24/14 8:05 PM: - Yes, unfortunately. Please see TIKA-1445. [~chrismattmann], any thoughts? was (Author: tpalsulich): Yes, unfortunately. Please see TIKA-1445. [~mattmann], any thoughts? Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.7 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: 1.7 release?
Sorry for coming late to the game on the implications of TIKA-1445. I don't want to hold up the release of 1.7. However, would it be possible to return to the legacy default behavior of extracting metadata from images? We can then document on the OCR parser page on the wiki that you need to install Tesseract _and_ make a change in the parser/mime config file. If you want this new capability, it will take a small bit of work until we solve TIKA-1445. I worry that the current behavior of 1.7 would be surprising to most non-dev users (well, even to at least one dev :) ). Cheers, Tim From: Oleg Tikhonov [olegtikho...@gmail.com] Sent: Friday, October 24, 2014 2:24 PM To: dev@tika.apache.org Subject: Re: 1.7 release? Hi Tyler, don't mention. Cheers, Oleg On Oct 24, 2014 8:02 PM, Tyler Palsulich tpalsul...@gmail.com wrote: Thank you for the help, Oleg! I just resolved TIKA-1422. So, are there any other issues anyone would like to resolve before a new release? Thanks, Tyler On Tue, Oct 21, 2014 at 2:42 AM, Oleg Tikhonov olegtikho...@gmail.com wrote: Sorry!!! On Tue, Oct 21, 2014 at 9:37 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Thanks Oleg, will try tomorrow for me Los angeles time! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Oleg Tikhonov o...@apache.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, October 20, 2014 at 11:20 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Please take a try with newest patch. Cheers, Oleg On Tue, Oct 21, 2014 at 9:08 AM, Oleg Tikhonov olegtikho...@gmail.com wrote: Taken. Thanks. in progress ... On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Trunk is the current checkout/branch: http://svn.apache.org/repos/asf/tika/trunk ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Oleg Tikhonov olegtikho...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, October 20, 2014 at 10:16 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Hi, I can try this on. What is a trunk? Thanks, Oleg On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hmm any idea why this is failing on Windows? Tyler P. and I were talking the other day - maybe we shouldn't run the tests from TIKA-1422 unless Tesseract is installed? Thoughts? ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Hong-Thai Nguyen thaicha...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, October 16, 2014 at 2:03 AM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Hi Andrzej, We are impatient for 1.7 release too. I'm having compiling problem of TIKA-1422 on me. If anyone can build successfully
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183873#comment-14183873 ] Tyler Palsulich commented on TIKA-1445: --- I've been trying my hand at this some time now. An idea I had was to create a temporary file from the input InputStream, then create new input streams from that file to run each Parser on. But, before this OCR Parser, we only ran one Parser on the image, anyway. So, what if there was a way to get the second best default parser for the image? An option is to hard code the exact working Parsers. But, in my opinion, we should load them dynamically. So, that would require getting a {{ListParser}}, instead of just the best Parser for a given MediaType ({{CompositeParser.getParsers(ParseContext)}}). If we only chose the second best Parser, we wouldn't have to merge the Metadata results, since the OCRParser doesn't add Metadata. But, it might call the ContentHandler. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1445.Mattmann.101214.patch.txt Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-774) ExifTool Parser
[ https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-774: --- Fix Version/s: (was: 1.7) 1.8 - push to 1.8 ExifTool Parser --- Key: TIKA-774 URL: https://issues.apache.org/jira/browse/TIKA-774 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.0 Environment: Requires be installed (http://www.sno.phy.queensu.ca/~phil/exiftool/) Reporter: Ray Gauss II Labels: features, newbie, patch, Fix For: 1.8 Attachments: testJPEG_IPTC_EXT.jpg, tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt Adds an external parser that calls ExifTool to extract extended metadata fields from images and other content types. In the core project: An ExifTool interface is added which contains Property objects that define the metadata fields available. An additional Property constructor for internalTextBag type. In the parsers project: An ExiftoolMetadataExtractor is added which does the work of calling ExifTool on the command line and mapping the response to tika metadata fields. This extractor could be called instead of or in addition to the existing ImageMetadataExtractor and JempboxExtractor under TiffParser and/or JpegParser but those have not been changed at this time. An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor. An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool metadata fields to existing tika and Drew Noakes metadata fields if enabled. An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag implementations in XML files. An ExifToolParserTest is added which tests several expected XMP and IPTC metadata values in testJPEG_IPTC_EXT.jpg. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1208) Migrate Any23 mime contributions to Tika
[ https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1208: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Migrate Any23 mime contributions to Tika Key: TIKA-1208 URL: https://issues.apache.org/jira/browse/TIKA-1208 Project: Tika Issue Type: Sub-task Components: mime Reporter: Lewis John McGibbney Fix For: 1.8 Attachments: TIKA-1208.patch We begin with one of the most obvious areas in which there is overlap. In short, the appeal of this package is the addition of detection for the following types: - text/n3 - text/rdf+n3 - application/n3 - text/x-nquads - text/rdf+nq - text/nq - application/nq - text/turtle - application/x-turtle - application/turtle - application/trix Therefore although both Tika and Any23 execute the task of Mimetype-related tasks, there is a contribution to be made. This involves the trasferral of code pertaining to pattern recogition, Mimetype XML defitinions within tika-mimetypes.xml and a Purifier implementation that removes all the eventual blank characters at the header of a file that might prevents its MIME Type detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1220) Parser implementration for IFC files
[ https://issues.apache.org/jira/browse/TIKA-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1220: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Parser implementration for IFC files Key: TIKA-1220 URL: https://issues.apache.org/jira/browse/TIKA-1220 Project: Tika Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.8 Attachments: 2012-03-23-Duplex-Programming.ifc The Industry Foundation Classes (IFC) [0] data model is intended to describe building and construction industry data. For the sake of argument, it can be considered as a more intelligent successor to the .dwg data models used within CAD models. I've tracked down a potential 3rd party library [1] which we maybe able to wrap and use within Tika however the provided software packages are licensed under: http://creativecommons.org/licenses/by-nc-sa/3.0/de/ so I am currently over on legal-discuss@ in an attempt to see if it is possible to wrap some code and contribute it to tika-parsers. When I get feedback from legal-discuss, and if this is a go-ahead, I'll need to help the developers package the code as a Maven artifact(s), then I will progress with writing the implementation. [0] http://en.wikipedia.org/wiki/Industry_Foundation_Classes [1] http://www.ifctoolsproject.com/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-891) Use POST in addition to PUT on method calls in tika-server
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-891: --- Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Use POST in addition to PUT on method calls in tika-server -- Key: TIKA-891 URL: https://issues.apache.org/jira/browse/TIKA-891 Project: Tika Issue Type: Improvement Components: general Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Trivial Fix For: 1.8 Per Jukka's email: http://s.apache.org/uR It would be a better use of REST/HTTP verbs to use POST to put content to a resource where we don't intend to store that content (which is the implication of PUT). Max suggested adding: {code} @POST {code} annotations to the methods we are currently exposing using PUT to take care of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1238) Update OutlookExtractor to handle codepage identification more rigorously
[ https://issues.apache.org/jira/browse/TIKA-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1238: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Update OutlookExtractor to handle codepage identification more rigorously - Key: TIKA-1238 URL: https://issues.apache.org/jira/browse/TIKA-1238 Project: Tika Issue Type: Improvement Components: parser Reporter: Tim Allison Assignee: Tim Allison Priority: Minor Fix For: 1.8 Since OutlookExtractor's codepage detection chunk was written, POI's HSMF has added more robutst capabilities for identifying codepages in Outlook .msg files. As a first step to integrating those improvements, I'll copy and paste some of POI's code into OutlookExtractor. As a second step, I'll expose more of HSMF's capabilities within POI and then factor out the duplicate code in Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1324) Use a common path for the Tika Server unpacker resources
[ https://issues.apache.org/jira/browse/TIKA-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1324: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Use a common path for the Tika Server unpacker resources Key: TIKA-1324 URL: https://issues.apache.org/jira/browse/TIKA-1324 Project: Tika Issue Type: Improvement Components: server Affects Versions: 1.5 Reporter: Nick Burch Fix For: 1.8 Currently, the two different methods of the Tika Server unpacker endpoint don't share a common url prefix, which causes them to clash with the new welcome endpoint As discussed on the mailing list, we should change these two have a common prefix, so that the urls are then: * /unpack/{id} * /unpack/all/{id} After making the change, the changelog and release notes need to be updated for it, as it is a breaking change for the (handful of) users of the endpoint This will help with TIKA-1269 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1273) old tika-server jar artifact contains no manifest so not able to invoke from shell
[ https://issues.apache.org/jira/browse/TIKA-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1273: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 old tika-server jar artifact contains no manifest so not able to invoke from shell -- Key: TIKA-1273 URL: https://issues.apache.org/jira/browse/TIKA-1273 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.5 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.8 I've never ever used the old tika-server artifact which is generated when one installs the server module. It needs to contain a manifest otherwise it cannot be invoked from the shell. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1445: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1384) Use tika-parent dependency management for common dependencies
[ https://issues.apache.org/jira/browse/TIKA-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1384: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Use tika-parent dependency management for common dependencies - Key: TIKA-1384 URL: https://issues.apache.org/jira/browse/TIKA-1384 Project: Tika Issue Type: Improvement Components: packaging Reporter: Tyler Palsulich Assignee: Tyler Palsulich Priority: Minor Fix For: 1.8 If we list a dependency in the dependencyManagement section of the tika-parent pom.xml, we can then include that dependency in a child module without specifying a version. For example, I updated the junit dependencies yesterday: https://github.com/apache/tika/commit/2fec4c61267ed2c465e7411d50fbf7e9841523d5 By using dependencyManagement, we can update the dependency version for all modules at once, rather than have different versions in different modules, like it was for junit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-985) Support for HTML5 elements
[ https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-985: --- Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Support for HTML5 elements -- Key: TIKA-985 URL: https://issues.apache.org/jira/browse/TIKA-985 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.2 Reporter: Markus Jelsma Fix For: 1.8 Attachments: TIKA-985-1.3-1.patch, TIKA-985-1.3-2.patch, TIKA-985-1.3-3.patch, TIKA-985-1.5.patch TagSoup's schema.tssl does not include some HTML5 elements (e.g. article, section). This prevents some custom ContentHandlers from reading expected elements and/or attributes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1343) Create a Tika Translator implementation that uses JoshuaDecoder
[ https://issues.apache.org/jira/browse/TIKA-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1343: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Create a Tika Translator implementation that uses JoshuaDecoder --- Key: TIKA-1343 URL: https://issues.apache.org/jira/browse/TIKA-1343 Project: Tika Issue Type: Bug Components: translation Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine translation system hosted at Github: http://joshua-decoder.org/ Joshua takes in corpuses and trains models that can then be used to do language translation. Currently there is support for e.g., Spanisn-English, Indian dialects-English, Chinese-English, and a few others. https://github.com/joshua-decoder/joshua/ It would be nice to build a Tika Translator on top of Joshua. There are of course several issues with this: * the models are huge - so we'll need a separate package or Maven module, maybe tika-translate-joshua or something to release the models and we'll need to build the models. I just went through the process of building the Spanish-English one, and it still needs to be rebuilt b/c I did it wrong, but it took over a day * there is a configuration for Joshua, and so we need some way of passing that config into the Translator. Not sure of the best way to do this. * Joshua isn't in the Central repository. I've started a discussion on the Joshua lists about this: https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0 Anyhoo, I've got a working patch right now with hard code stuff, and a manual install into my Maven repo for brave souls out there that want to try it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1295) Make some Dublin Core items multi-valued
[ https://issues.apache.org/jira/browse/TIKA-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1295: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Make some Dublin Core items multi-valued Key: TIKA-1295 URL: https://issues.apache.org/jira/browse/TIKA-1295 Project: Tika Issue Type: Bug Components: metadata Reporter: Tim Allison Assignee: Tim Allison Priority: Minor Fix For: 1.8 According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, dc:title, dc:description and dc:rights should allow multiple values because of language alternatives. Unless anyone objects in the next few days, I'll switch those to Property.toInternalTextBag() from Property.toInternalText(). I'll also modify PDFParser to extract dc:rights. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
[ https://issues.apache.org/jira/browse/TIKA-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1059: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Better Handling of InterruptedException in ExternalParser and ExternalEmbedder -- Key: TIKA-1059 URL: https://issues.apache.org/jira/browse/TIKA-1059 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.3 Reporter: Ray Gauss II Fix For: 1.8 The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch {{InterruptedException}} and ignore it. The methods should either call {{interrupt()}} on the current thread or re-throw the exception, possibly wrapped in a {{TikaException}}. See TIKA-775 for a previous discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1417) Create Extract Embedded Images from PDFs Example
[ https://issues.apache.org/jira/browse/TIKA-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1417: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Create Extract Embedded Images from PDFs Example Key: TIKA-1417 URL: https://issues.apache.org/jira/browse/TIKA-1417 Project: Tika Issue Type: Improvement Components: example Reporter: Tyler Palsulich Priority: Minor Fix For: 1.8 Users commonly want to turn on extraction of images embedded in PDFs (e.g. TIKA-1414). Tika has the capability, but it's not clear how to use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1442: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.8 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1328) Translate Metadata and Content
[ https://issues.apache.org/jira/browse/TIKA-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1328: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Translate Metadata and Content -- Key: TIKA-1328 URL: https://issues.apache.org/jira/browse/TIKA-1328 Project: Tika Issue Type: New Feature Components: translation Reporter: Tyler Palsulich Fix For: 1.8 Right now, Translation is only done on Strings. Ideally, users would be able to turn on translation while parsing. I can think of a couple options: - Make a TranslateAutoDetectParser. Automatically detect the file type, parse it, then translate the content. - Make a Context switch. When true, translate the content regardless of the parser used. I'm not sure the best way to go about this method, but I prefer it over another Parser. Regardless, we need a black or white list for translation. I think black list would be the way to go -- which fields should not be translated (dates, versions, ...) Any ideas? Also, somewhat unrelated, does anyone know of any other open source translation libraries? If we were really lucky, it wouldn't depend on an online service. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1425) Automatic batching of Microsoft service calls
[ https://issues.apache.org/jira/browse/TIKA-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1425: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Automatic batching of Microsoft service calls - Key: TIKA-1425 URL: https://issues.apache.org/jira/browse/TIKA-1425 Project: Tika Issue Type: Improvement Components: translation Affects Versions: 1.6 Reporter: Lewis John McGibbney Fix For: 1.8 Right now when I use the following code I get the stack trace at the bottom of this description. This seems to be because the Request URI is too large to make the service request. We need to have a mechansim within the call to Tika.translate which will, on a service-by-service basis, determine the maximum Request URI which can be sent. I beleive that this should be on the Tika side as how else am I meant to know the maximum request size? {code:title=translator.java|borderStyle=solid} +Translator translate = new MicrosoftTranslator(); +((MicrosoftTranslator) translate).setId(...); +((MicrosoftTranslator) translate).setSecret(...); for (java.util.Map.EntryText, Parse entry : parseResult) { Parse parse = entry.getValue(); LOG.info(-\nUrl\n---\n); @@ -201,7 +207,7 @@ System.out.print(parse.getData().toString()); if (dumpText) { LOG.info(-\nParseText\n-\n); -System.out.print(parse.getText()); +System.out.print(translate.translate(parse.getText(), fr)); } {code} {code:title=stacktrace.log|borderStyle=solid} Exception in thread main java.lang.Exception: [microsoft-translator-api] Error retrieving translation : Server returned HTTP response code: 414 for URL: http://api.microsofttranslator.com/V2/Ajax.svc/Translate?from=to=frtext=%D0%A4%D0... ... at com.memetix.mst.MicrosoftTranslatorAPI.retrieveString(MicrosoftTranslatorAPI.java:202) at com.memetix.mst.translate.Translate.execute(Translate.java:61) at com.memetix.mst.translate.Translate.execute(Translate.java:76) at org.apache.tika.language.translate.MicrosoftTranslator.translate(MicrosoftTranslator.java:104) at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:210) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:228) Caused by: java.io.IOException: Server returned HTTP response code: 414 for URL: http://api.microsofttranslator.com/V2/Ajax.svc/Translate?from=to=frtext=%D0%A4%D0%BE%D1%80%D1%83%D0%B... ... at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1675) at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1673) at java.security.AccessController.doPrivileged(Native Method) at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1671) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1244) at com.memetix.mst.MicrosoftTranslatorAPI.retrieveResponse(MicrosoftTranslatorAPI.java:178) at com.memetix.mst.MicrosoftTranslatorAPI.retrieveString(MicrosoftTranslatorAPI.java:199) ... 6 more Caused by: java.io.IOException: Server returned HTTP response code: 414 for URL: http://api.microsofttranslator.com/V2/Ajax.svc/Translate?from=to=frtext=%D0%A4%D0%BE... ... at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1626) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468) at com.memetix.mst.MicrosoftTranslatorAPI.retrieveResponse(MicrosoftTranslatorAPI.java:177) ... 7 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1408) Fix version for tikadotnet to be tracked along with trunk and release version
[ https://issues.apache.org/jira/browse/TIKA-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1408: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Fix version for tikadotnet to be tracked along with trunk and release version - Key: TIKA-1408 URL: https://issues.apache.org/jira/browse/TIKA-1408 Project: Tika Issue Type: Bug Components: packaging Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 As reported by [~thaichat04] the tikadotnet versioning doesn't match up with trunk. This is because we aren't releasing this code yet and it's not part of the pom.xml file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1072) AIOOBE when handling embedded document in .doc file
[ https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1072: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 AIOOBE when handling embedded document in .doc file --- Key: TIKA-1072 URL: https://issues.apache.org/jira/browse/TIKA-1072 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Fix For: 1.8 Attachments: 20-Force-on-a-current-S00.doc, Ole10NativeEntry.bin I have a Word (.doc) document that hits an exception when I run: {noformat} java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar /x/tmp/20-Force-on-a-current-S00.doc {noformat} Here's the exception: {noformat} Caused by: java.lang.ArrayIndexOutOfBoundsException: 40 at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225) at org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:139) at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) {noformat} It happens when we try to parse an OLE10 embedded object ... the code that does this parsing captures and ignores Ole10NativeException and skips the entry ... so I'm wondering if we should also catch AIOOBE and skip the entry? Ie, maybe this entry really is not OLE10, and the Ole10Native code is failing to throw Ole10NativeException for it? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE
[ https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1308: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Support in memory parse mode(don't create temp file): to support run Tika in GAE Key: TIKA-1308 URL: https://issues.apache.org/jira/browse/TIKA-1308 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: yuanyun.cn Labels: gae Fix For: 1.8 I am trying to use Tika in GAE and write a simple servlet to extract meta data info from jpeg: String urlStr = req.getParameter(imageUrl); byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr)); ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData); Metadata metadata = new Metadata(); BodyContentHandler ch = new BodyContentHandler(); AutoDetectParser parser = new AutoDetectParser(); parser.parse(bais, ch, metadata, new ParseContext()); bais.close(); This fails with exception: Caused by: java.lang.SecurityException: Unable to create temporary file at java.io.File.createTempFile(File.java:1986) at org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66) at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533) at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242 Checked the code, in org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, Metadata, ParseContext), it creates a temp file from the input stream. I can understand why tika create temp file from the stream: so tika can parse it multiple times. But as GAE and other cloud servers are getting more popular, is it possible to avoid create temp file: instead we can copy the origin stream to a byteArray stream, so tika can also parse it multiple times. -- This will have a limit on the file size, as tika keeps the whole file in memory, but this can make tika work in GAE and maybe other cloud server. We can add a parameter in parser.parse to indicate whether do in memory parse only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content
[ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-819: --- Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Make Option to Exclude Embedded Files' Text for Text Content Key: TIKA-819 URL: https://issues.apache.org/jira/browse/TIKA-819 Project: Tika Issue Type: New Feature Components: general Affects Versions: 1.0 Environment: Windows-7 + JDK 1.6 u26 Reporter: Albert L. Fix For: 1.8 It would be nice to be able to disable text content from embedded files. For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX. In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle
[ https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1276: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Missing embedded dependencies in tika-bundle Key: TIKA-1276 URL: https://issues.apache.org/jira/browse/TIKA-1276 Project: Tika Issue Type: Bug Components: packaging Affects Versions: 1.5 Environment: OSGI, Apache Felix via Apache Sling Launcher Reporter: Rupert Westenthaler Fix For: 1.8 Attachments: TIKA-1276_20140423_rwesten.diff, TIKA-1276_20140428_2_rwesten.diff, TIKA-1276_20140428_3_rwesten.diff, TIKA-1276_20140428_rwesten.diff While updating from tika 1.2 to 1.5 I that the `org.apache.tika:tika-bundle:1.5` module has some missing dependences. 1. `com.uwyn:jhighlight:1.0` is not embedded Because of that installing the bundle results in the following exception {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement [103.0] osgi.wiring.package; (osgi.wiring.package=com.uwyn.jhighlight.renderer) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 2. `org.ow2.asm:asm:4.1` is not embedded because `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and therefore the `Embed-Dependency` directive `asm` does not match any dependency. Because of that one do get the following exception (after fixing (1)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0 org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; ((osgi.wiring.package=org.objectweb.asm)(version=4.1.0)(!(version=5.0.0))) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} There are two possibilities to fix this (a) change the `Embed-Dependency` to `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the tika-bundle pom file. 3. `edu.ucar:netcdf:4.2-min` is not embedded Because of that one does get the following exception (after fixing (1) and (2)) {code} org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)) org.osgi.framework.BundleException: Unresolved constraint in bundle org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2) at org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) at java.lang.Thread.run(Thread.java:744) {code} 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime After fixing the above issues the tika-bundle was started successfully. However when extracting EXIG metadata from a jpeg image I got the following exception. {code} java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException at com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112) at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71) at org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91) at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56) [..]
[jira] [Updated] (TIKA-1390) Create tika-example module
[ https://issues.apache.org/jira/browse/TIKA-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1390: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Create tika-example module -- Key: TIKA-1390 URL: https://issues.apache.org/jira/browse/TIKA-1390 Project: Tika Issue Type: Bug Components: example Reporter: Tyler Palsulich Fix For: 1.8 This issue will track the initial creation of the tika-example module. Subtasks will be used for the first few examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1426) Let's allow users to specify a tika config file on the commandline for tika-app and tika-server
[ https://issues.apache.org/jira/browse/TIKA-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1426: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Let's allow users to specify a tika config file on the commandline for tika-app and tika-server --- Key: TIKA-1426 URL: https://issues.apache.org/jira/browse/TIKA-1426 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.8 It would be handy to be able to specify a tika-config file when using tika-app and tika-server. I added this capability to tika-app as part of TIKA-1418. I should have opened a separate issue at the time (mea culpa). This present issue covers both tika-app and tika-server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1315) Basic list support in WordExtractor
[ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1315: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Basic list support in WordExtractor --- Key: TIKA-1315 URL: https://issues.apache.org/jira/browse/TIKA-1315 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Filip Bednárik Priority: Minor Fix For: 1.8 Attachments: ListManager.tar.bz2, ListNumbering.patch, ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch Hello guys, I am really sorry to post issue like this because I have no other way of contacting you and I don't quite understand how you manage forks and pull requests (I don't think you do that). Plus I don't know your coding styles and stuff. In my project I needed for tika to parse numbered lists from word .doc documents, but TIKA doesn't support it. So I looked for solution and found one here: http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ . So I adapted this solution to Apache TIKA with few fixes and improvements. Anyway feel free to use any of it so it can help people who struggle with lists in TIKA like I did. Attached files are: Updated test Fixed WordExtractor Added ListUtils -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1318) Use of Deprecated Word6Extractor.getParagraphText() Method
[ https://issues.apache.org/jira/browse/TIKA-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1318: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Use of Deprecated Word6Extractor.getParagraphText() Method -- Key: TIKA-1318 URL: https://issues.apache.org/jira/browse/TIKA-1318 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Tyler Palsulich Priority: Minor Labels: deprecation Fix For: 1.8 org.apache.tika.parser.microsoft.WordExtractor.parseWord6() uses the deprecated Word6Extractor.getParagraphText() method. getParagraphText() is supposed to return a String[] with an element for each paragraph in the text. The replacement is getText(), which lets paragraph, cell, etc separation be implementation specific. I'm not sure, at this point, how the POI WordExtractor separates them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1106) CLAVIN Integration
[ https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1106: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 CLAVIN Integration -- Key: TIKA-1106 URL: https://issues.apache.org/jira/browse/TIKA-1106 Project: Tika Issue Type: Wish Components: general Affects Versions: 1.3 Environment: All Reporter: Adam Estrada Priority: Minor Labels: entity, geospatial Fix For: 1.8 I've been evaluating CLAVIN as a way to extract location information from unstructured text. It seems like meshing it with Tika in some way would make a lot of sense. From CLAVIN website... {quote} CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source software package for document geotagging and geoparsing that employs context-based geographic entity resolution. It combines a variety of open source tools with natural language processing techniques to extract location names from unstructured text documents and resolve them against gazetteer records. Importantly, CLAVIN does not simply look up location names; rather, it uses intelligent heuristics in an attempt to identify precisely which Springfield (for example) was intended by the author, based on the context of the document. CLAVIN also employs fuzzy search to handle incorrectly-spelled location names, and it recognizes alternative names (e.g., Ivory Coast and Côte d'Ivoire) as referring to the same geographic entity. By enriching text documents with structured geo data, CLAVIN enables hierarchical geospatial search and advanced geospatial analytics on unstructured data. {quote} There was only one other instance of the word clavin mentioned in the ASF jira site so I thought it was definitely worth posting here. https://github.com/Berico-Technologies/CLAVIN -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1269) Self-hosted documentation for the JAX-RS Server
[ https://issues.apache.org/jira/browse/TIKA-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1269: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Self-hosted documentation for the JAX-RS Server --- Key: TIKA-1269 URL: https://issues.apache.org/jira/browse/TIKA-1269 Project: Tika Issue Type: Improvement Components: server Affects Versions: 1.5 Reporter: Nick Burch Fix For: 1.8 Attachments: TIKA-1269-miredot.patch, enable-enunciate.patch Currently, if you fire up the JAX-RS Tika Server, and go to the root of the server in a web browser, you get an empty page back. You have to know to head over to https://wiki.apache.org/tika/TikaJAXRS find out what the available URLs are We should self-host some simple documentation on the server at the root of it, so that people can discover what it offers. Ideally, this should be largely auto-generated based on the endpoints, so that we don't risk missing things when we add new features This will also allow us to potentially offer a sample running version of the server for people to discover Tika with -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1395) Create embedded image extraction example
[ https://issues.apache.org/jira/browse/TIKA-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1395: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Create embedded image extraction example Key: TIKA-1395 URL: https://issues.apache.org/jira/browse/TIKA-1395 Project: Tika Issue Type: Sub-task Components: example Reporter: Tyler Palsulich Priority: Minor Fix For: 1.8 Create an example of how to turn do embedded image extraction and parsing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1383) Simplify TikeServerCli endpoint setup code
[ https://issues.apache.org/jira/browse/TIKA-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1383: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Simplify TikeServerCli endpoint setup code -- Key: TIKA-1383 URL: https://issues.apache.org/jira/browse/TIKA-1383 Project: Tika Issue Type: Improvement Components: server Reporter: Sergey Beryozkin Assignee: Sergey Beryozkin Priority: Trivial Fix For: 1.8 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1167) Embedded object not extracted
[ https://issues.apache.org/jira/browse/TIKA-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1167: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Embedded object not extracted - Key: TIKA-1167 URL: https://issues.apache.org/jira/browse/TIKA-1167 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Daniel Bonniot de Ruisselet Priority: Critical Fix For: 1.8 Attachments: Doc w Structure that wont extract.docx For the attached docx, tika seems to detect the embedded object, as shown by this tag: {{div class=embedded id=rId10/}} However, extraction itself (using -z on the command line, or using the API) does not seem to work for this object: {{java -jar tika-app-1.4.jar -z Doc\ w\ Structure\ that\ wont\ extract.docx}} {{Extracting 'rId9_image1.wmf' (application/x-msmetafile) to /tmp/tika/rId9_image1.wmf}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-995) XHTMLContentHandler doesn't pass attributes of body element
[ https://issues.apache.org/jira/browse/TIKA-995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-995: --- Fix Version/s: (was: 1.7) 1.8 - push to 1.8 XHTMLContentHandler doesn't pass attributes of body element --- Key: TIKA-995 URL: https://issues.apache.org/jira/browse/TIKA-995 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2 Reporter: Markus Jelsma Fix For: 1.8 Attachments: TIKA-995-1.3-1.patch, TIKA-995-unit.patch XHTMLContentHandler.startElement() uses lazyHead() for the body element because it's defined in the AUTO Set. As a consequence, attributes of the body element are not passed to downstream content handlers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1307) Jenkins Java7 job requires a profile in order to build 'tika-java7' module.
[ https://issues.apache.org/jira/browse/TIKA-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1307: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Jenkins Java7 job requires a profile in order to build 'tika-java7' module. --- Key: TIKA-1307 URL: https://issues.apache.org/jira/browse/TIKA-1307 Project: Tika Issue Type: Bug Components: packaging Affects Versions: 1.5 Reporter: Lewis John McGibbney Fix For: 1.8 N.B. Can someone please create a *build* tag in Admin area? The assign it to this issue? This issue was flagged up by Hong-Thai during the DISCUSS nightly builds thread recently http://www.mail-archive.com/dev%40tika.apache.org/msg07963.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1366) Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse
[ https://issues.apache.org/jira/browse/TIKA-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1366: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse Key: TIKA-1366 URL: https://issues.apache.org/jira/browse/TIKA-1366 Project: Tika Issue Type: Improvement Components: server Reporter: Sergey Beryozkin Priority: Minor Fix For: 1.8 Some of Tika Server services will benefit from optionally supporting JAX-RS 2.0 AsyncResponse -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1108) Represent individual slides in pptx
[ https://issues.apache.org/jira/browse/TIKA-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1108: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Represent individual slides in pptx --- Key: TIKA-1108 URL: https://issues.apache.org/jira/browse/TIKA-1108 Project: Tika Issue Type: Improvement Components: parser Reporter: Daniel Bonniot de Ruisselet Fix For: 1.8 When parsing ppt, tika produces for each slide: div class=slide However for pptx these seem to be missing, all the text is directly under body. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1079) Word document hits AIOOBE in SummaryExtractor.parseSummaries
[ https://issues.apache.org/jira/browse/TIKA-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1079: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Word document hits AIOOBE in SummaryExtractor.parseSummaries Key: TIKA-1079 URL: https://issues.apache.org/jira/browse/TIKA-1079 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Fix For: 1.8 Attachments: guide_to_daips_(id_3152_ver_1.0.0).doc I'm not yet sure if this is a corrupted document (though, MS Word opens it just fine) or a bug in POI ... but I hit this exc when running it through TikaCLI: {noformat} java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.poi.hpsf.CodePageString.init(CodePageString.java:161) at org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:158) at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:163) at org.apache.poi.hpsf.Property.init(Property.java:164) at org.apache.poi.hpsf.Section.init(Section.java:277) at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:451) at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:246) at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:78) at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:69) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-539) Encoding detection is too biased by encoding in meta tag
[ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-539: --- Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Encoding detection is too biased by encoding in meta tag Key: TIKA-539 URL: https://issues.apache.org/jira/browse/TIKA-539 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 0.8, 0.9, 0.10 Reporter: Reinhard Schwab Assignee: Ken Krugler Fix For: 1.8 Attachments: TIKA-539.patch, TIKA-539_2.patch if the encoding in the meta tag is wrong, this encoding is detected, even if there is the right encoding set in metadata before(which can be from http response header). test code to reproduce: static String content = htmlhead\n + meta http-equiv=\content-type\ content=\application/xhtml+xml; charset=iso-8859-1\ / + /headbodyÜber den Wolken\n/body/html; /** * @param args * @throws IOException * @throws TikaException * @throws SAXException */ public static void main(String[] args) throws IOException, SAXException, TikaException { Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/html); metadata.set(Metadata.CONTENT_ENCODING, UTF-8); System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); InputStream in = new ByteArrayInputStream(content.getBytes(UTF-8)); AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler h = new BodyContentHandler(1); parser.parse(in, h, metadata, new ParseContext()); System.out.print(h.toString()); System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1423) Build a parser to extract data from GRIB formats
[ https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1423: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Build a parser to extract data from GRIB formats Key: TIKA-1423 URL: https://issues.apache.org/jira/browse/TIKA-1423 Project: Tika Issue Type: New Feature Components: metadata, mime, parser Affects Versions: 1.6 Reporter: Vineet Ghatge Assignee: Vineet Ghatge Priority: Critical Labels: features, newbie Fix For: 1.8 Attachments: GribParser.java, NLDAS_FORA0125_H.A20130112.1200.002.grb, fileName.html, gdas1.forecmwf.2014062612.grib2 Arctic dataset contains a MIME format called GRIB - General Regularlydistributed information in Binary form http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is a concise data format used in meteorology to store historical and weather data. There are 2 different types of the format GRIB 0, GRIB 2. The focus will be on GRIB 2 which is the most prevalent. Each GRIB record intended for either transmission or storage contains a single parameter with values located at an array of grid points, or represented as a set of spectral coefficients, for a single level (or layer), encoded as a continuous bit stream. Logical divisions of the record are designated as sections, each of which provides control information and/or data. A GRIB record consists of six sections, two of which are optional: (0) Indicator Section (1) Product Definition Section (PDS) (2) Grid Description Section (GDS) optional (3) Bit Map Section (BMS) optional (4) Binary Data Section (BDS) (5) '' (ASCII Characters) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1379) error in Tika().detect for xml files with xades signature
[ https://issues.apache.org/jira/browse/TIKA-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1379: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 error in Tika().detect for xml files with xades signature - Key: TIKA-1379 URL: https://issues.apache.org/jira/browse/TIKA-1379 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.4 Reporter: Alessandro De Angelis Fix For: 1.8 we tried to get the mime type of an xml file with xades signature embedded. the result is text/html and not the expected text/xml or application/xml. here is an example of the xml file: VERBALI ad_cod=D69017 batch_id=0 cds_cod=D69 data_app=2013-09-23 VERBALE Id=1 tipologia=Verbale esame VERB_NUM00094853 0003 2/VERB_NUM DATA_APP2013-09-23/DATA_APP DATA_ESA2013-09-23/DATA_ESA AD_CODD69017/AD_COD ADFILOSOFIA DELLA SCIENZA/AD CDS_CODD69/CDS_COD CDSTEATRO E ARTI VISIVE/CDS TIPO_ESA/TIPO_ESA MAT1233456/MAT NOMEPAOLINO/NOME COGNOMEPAPERINO/COGNOME VOTO23.0/VOTO VOTODECOD23/VOTODECOD CAUSALE/CAUSALE TIPO_MODULO/TIPO_MODULO IMG_PATH/IMG_PATH AA_SES_ID2012/AA_SES_ID AD_CFU6.0/AD_CFU NOTA/NOTA ATENEO9/ATENEO ATENEO_DESجامعة البندقية - TEST/ATENEO_DES TIPO_DOCUMENTOVerbale_3/TIPO_DOCUMENTO TITOLARE_PROCEDIMENTOQUI QUO QUA/TITOLARE_PROCEDIMENTO AD_STU_CODD69017/AD_STU_COD AD_STUFILOSOFIA DELLA SCIENZA/AD_STU CDS_STU_CODD69/CDS_STU_COD CDS_STUTEATRO E ARTI VISIVE/CDS_STU DOCENTEQUI QUO QUA/DOCENTE DATA_DOCUMENTO26-09-2013 09:55:53 CEST(+0200)/DATA_DOCUMENTO SOFTWARE_DI_CREAZIONE NOME3/NOME VERSIONE11.09.03/VERSIONE /SOFTWARE_DI_CREAZIONE /VERBALEds:Signature xmlns:ds=http://www.w3.org/2000/09/xmldsig#; Id=sig08744308748201048377 ds:SignedInfo ds:CanonicalizationMethod Algorithm=http://www.w3.org/2006/12/xml-c14n11;/ds:CanonicalizationMethod ds:SignatureMethod Algorithm=http://www.w3.org/2001/04/xmldsig-more#rsa-sha256;/ds:SignatureMethod ds:Reference URI= ds:Transforms ds:Transform Algorithm=http://www.w3.org/2002/06/xmldsig-filter2; dsig-xpath:XPath xmlns:dsig-xpath=http://www.w3.org/2002/06/xmldsig-filter2; Filter=subtract/descendant::ds:Signature/dsig-xpath:XPath /ds:Transform ds:Transform Algorithm=http://www.w3.org/TR/1999/REC-xslt-19991116; xsl:stylesheet xmlns:kion=http://www.kion.it/webesse3/multilingua; xmlns:xsl=http://www.w3.org/1999/XSL/Transform; exclude-result-prefixes=kion version=1.0 kion:ml module=FirmaDigitale target=kion/kion:ml xsl:output method=xml/xsl:output xsl:variable name=mostra_ad_figlie select=1/xsl:variable xsl:variable name=verbale_root select=/VERBALI/VERBALE/xsl:variable xsl:variable name=sostituzione_root select=/VERBALI/VERBALE/SOSTITUZIONE_DOCUMENTO/xsl:variable xsl:variable name=RAGG_ROOT select=/VERBALI/VERBALE/RAGGRUPPAMENTO/xsl:variable xsl:variable name=COMM_ROOT select=/VERBALI/VERBALE/COMMISSIONE/xsl:variable xsl:template match=/ html head meta content=text/html;charset=UTF-8 http-equiv=Content-Type/meta xsl:choose xsl:when test=$sostituzione_root titleDichiarazione conformità Verbale Esame/title /xsl:when xsl:otherwise titleVerbalizzazione esame/title /xsl:otherwise /xsl:choose style type=text/css td {font-family: Arial; font-size:10pt;} div {font-family: Arial; font-size:10pt;} pre {font-family: Arial; font-size:10pt;} /style /head body table xsl:choose xsl:when test=$sostituzione_root trtd align=center colspan=2bigstrongxsl:value-of select=$verbale_root/ATENEO_DES/xsl:value-of/strong/bigbr/br/td/tr trtd align=center colspan=2bigstrongDICHIARAZIONE DI CONFORMITÀ/strong/bigbr/br/td/tr
[jira] [Updated] (TIKA-1435) Update rome dependency to 1.5
[ https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1435: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Update rome dependency to 1.5 - Key: TIKA-1435 URL: https://issues.apache.org/jira/browse/TIKA-1435 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Johannes Mockenhaupt Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.8 Attachments: netcdf-deps-changes.diff Rome 1.5 has been released to Sonatype (https://github.com/rometools/rome/issues/183). Though the website (http://rometools.github.io/rome/) is blissfully ignorant of that. The update is mostly maintenance, adopting slf4j and generics as well as moving the namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1456) Visual Sentiment API parser
[ https://issues.apache.org/jira/browse/TIKA-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1456: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Visual Sentiment API parser --- Key: TIKA-1456 URL: https://issues.apache.org/jira/browse/TIKA-1456 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Integrate the Visual Sentibank API as a parser for images. We can use Aperture from CMU, it's released under the MIT license: https://github.com/d8w/aperture -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1301) Establish TikaServer on Apache hosted VM
[ https://issues.apache.org/jira/browse/TIKA-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1301: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Establish TikaServer on Apache hosted VM Key: TIKA-1301 URL: https://issues.apache.org/jira/browse/TIKA-1301 Project: Tika Issue Type: Bug Components: server Reporter: Lewis John McGibbney Fix For: 1.8 Over in Any23, Infra recently provisioned us with a nice shiny new VM to run our service on http://any23.org I would like to do the same for Tika. I have some scripts on the Any23 VM which will pull stable nightly tika-server snapshots and deploy them to the VM. This is really nice for both dev's and users alike. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-988) We don't extract a placeholder for a Word document embedded in an Excel document
[ https://issues.apache.org/jira/browse/TIKA-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-988: --- Fix Version/s: (was: 1.7) 1.8 - push to 1.8 We don't extract a placeholder for a Word document embedded in an Excel document Key: TIKA-988 URL: https://issues.apache.org/jira/browse/TIKA-988 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Fix For: 1.8 Attachments: bug31373.xls In TIKA-956 we fixed the Word parser so that at the point where an embedded document appears, we output a div class=embedded id=_XXX/ tag. It would be nice to do this for documents embedded in Excel too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1387) Add forbidden-apis checker to TIKA build
[ https://issues.apache.org/jira/browse/TIKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1387: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Add forbidden-apis checker to TIKA build Key: TIKA-1387 URL: https://issues.apache.org/jira/browse/TIKA-1387 Project: Tika Issue Type: Improvement Components: general Reporter: Uwe Schindler Assignee: Tyler Palsulich Fix For: 1.8 Attachments: TIKA-1387.palsulich.080614.patch, TIKA-1387.patch, TIKA-1387.patch, TIKA-1387.patch Lucene and many other projects already use the forbidden-apis checker to prevent use of some broken classes/signatures from the JDK. These are especially thing using default character sets or default locales. The forbidden-api checker can also be used to explcitely disallow specific methods, if they have security issues (e.g., creating XML parsers without disabling external entity support). The attached patch adds the forbidden-api checker to the tika-parent pom file with default configuration. Running it fails with many errors in TIKA core already: {noformat} [INFO] --- forbiddenapis:1.6.1:check (default) @ tika-core --- [INFO] Scanning for classes to check... [INFO] Reading bundled API signatures: jdk-unsafe [INFO] Reading bundled API signatures: jdk-deprecated [INFO] Loading classes to check... [INFO] Scanning for API signatures and dependencies... [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.language.LanguageProfilerBuilder (LanguageProfilerBuilder.java:407) [ERROR] Forbidden method invocation: java.lang.String#toUpperCase() [Uses default locale] [ERROR] in org.apache.tika.io.FilenameUtils (FilenameUtils.java:68) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:257) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:395) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:416) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:438) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:532) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:550) [ERROR] Forbidden method invocation: java.lang.String#init(byte[]) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:588) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:656) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:782) [ERROR] Forbidden method invocation: java.lang.String#getBytes() [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:851) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:957) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.io.IOUtils (IOUtils.java:1064) [ERROR] Forbidden method invocation: java.io.OutputStreamWriter#init(java.io.OutputStream) [Uses default charset] [ERROR] in org.apache.tika.sax.WriteOutContentHandler (WriteOutContentHandler.java:93) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser (ExternalParser.java:234) [ERROR] Forbidden method invocation: java.io.InputStreamReader#init(java.io.InputStream) [Uses default charset] [ERROR] in org.apache.tika.parser.external.ExternalParser$3 (ExternalParser.java:294) [ERROR] Forbidden method invocation: java.util.Calendar#getInstance(java.util.Locale) [Uses default locale or time zone] [ERROR] in org.apache.tika.utils.DateUtils (DateUtils.java:83) [ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale] [ERROR] in org.apache.tika.utils.DateUtils (DateUtils.java:91) [ERROR] Forbidden method invocation: java.lang.String#toLowerCase() [Uses default locale]
[jira] [Updated] (TIKA-987) Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted
[ https://issues.apache.org/jira/browse/TIKA-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-987: --- Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted Key: TIKA-987 URL: https://issues.apache.org/jira/browse/TIKA-987 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Fix For: 1.8 Attachments: picture.doc, picture_3.doc I have two Word docs, both containing the same drawing, but one has text added. In one case (picture.doc) the extraction is correct: it contains only an embedded image.wmf; when I view the image it's correct. In the second case (picture_3.doc) the picture is extracted as image (no extension), and is 0 bytes, and there is an invalid character (mapped to unicode replacement char) inserted before the image: {noformat} title/ /head bodyp�img src=embedded:image1 alt=image1//p p/ p/ pvehicle /p {noformat} (Though, the text vehicle is extracted correctly). I dug a bit, and with the 2nd doc there is an embedded {SHAPE * MERGEFORMAT} field, which we invoke WordExtractor.handleSpecialCharacterRuns on, and somehow it extracts the 0-byte no-extension image as well as the invalid character. With the first doc there is no field (at least not one that's handle with handleSpecialCharacterRuns...). Otherwise I'm not sure how to fix... it could be something is going wrong in how POI parses the Pictures from PictureSource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1416) Refactor Translator Exception Handling
[ https://issues.apache.org/jira/browse/TIKA-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1416: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Refactor Translator Exception Handling -- Key: TIKA-1416 URL: https://issues.apache.org/jira/browse/TIKA-1416 Project: Tika Issue Type: Bug Components: translation Reporter: Tyler Palsulich Fix For: 1.8 `Translator.translate()` currently throws `Exception`. We should make it more specific. The only real limitation comes from MicrosoftTranslator -- the library used throws `Exception`, but that shouldn't mean Tika does too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-776) ExifTool Embedder
[ https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-776: --- Fix Version/s: (was: 1.7) 1.8 - push to 1.8 ExifTool Embedder - Key: TIKA-776 URL: https://issues.apache.org/jira/browse/TIKA-776 Project: Tika Issue Type: New Feature Components: metadata Affects Versions: 1.0 Environment: ExifTool is required (http://www.sno.phy.queensu.ca/~phil/exiftool/) Reporter: Ray Gauss II Labels: embed, exiftool, patch Fix For: 1.8 Attachments: tika-parsers-exiftool-embed-patch.txt This patch adds an ExifTool ExternalEmbedder which builds upon the work in issue TIKA-774 and TIKA-775. In the tika-parsers an ExiftoolExternalEmbedder is added which extends ExternalEmbedder to programmatically create an Embedder which calls the ExifTool command line to embed tika metadata into a file stream and an ExiftoolExternalEmbedderTest unit test is added which embeds several IPTC and XMP fields then parses the resulting file stream to verify the operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies
[ https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1367: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Tika documentation should list tika-parsers parser dependencies --- Key: TIKA-1367 URL: https://issues.apache.org/jira/browse/TIKA-1367 Project: Tika Issue Type: Improvement Components: documentation Reporter: Sergey Beryozkin Fix For: 1.8 tika-parsers module has many strong transitive parser dependencies. Maven users of tika-parsers have to exclude all the transitivie dependencies manually. Documenting the list of the existing transitive dependencies and keeping the list up to date will help developers exclude the libraries not needed for a given project. -- This message was sent by Atlassian JIRA (v6.3.4#6332)