[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor
[ https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077561#comment-14077561 ] Vilmos Papp commented on TIKA-1369: --- Hi, I've sent a pull request on github to fix this: https://github.com/chrismattmann/tika/pull/1, I hope I sent it to the proper person, if not, where should I send it? Regards, Vilmos Date parsing and thread safety in ImageMetadataExtractor Key: TIKA-1369 URL: https://issues.apache.org/jira/browse/TIKA-1369 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Environment: OS X 10.9.4 Java 7_60 Reporter: John Gibson Priority: Critical The {{ImageMetadataExtractor}} uses a static instance of {{SimpleDateFormat}}. This is not thread safe. {code:title=ImageMetadataExtractor.java} static class ExifHandler implements DirectoryHandler { private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new SimpleDateFormat(-MM-dd'T'HH:mm:ss); ... public void handleDateTags(Directory directory, Metadata metadata) throws MetadataException { // Date/Time Original overrides value from ExifDirectory.TAG_DATETIME Date original = null; if (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) { original = directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL); // Unless we have GPS time we don't know the time zone so date must be set // as ISO 8601 datetime without timezone suffix (no Z or +/-) if (original != null) { String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor uses metadata.set(TikaCoreProperties.CREATED, datetimeNoTimeZone); metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone); } } ... {code} This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. In the discussion there the idea of using alternative thread-safe (and faster) formatters from either Joda time or Commons Lang were dismissed because they would add too many dependencies. Given that Tika already has a fairly large laundry list of dependencies to parse content, adding one more JAR to make sure things don't break is probably a good idea. In addition, because no timezone or locale are specified by either Tika's formatter or the call to com.drew.metadata.Directory it can wreak havok during randomized testing. Given that the timezone is unknown, why not just default it to UTC and let the caller guess the timezone? As it stands I have to reparse all of the dates into UTC to get stable behavior across timezones. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: [VOTE] Apache Tika 1.6 release candidate #1
On Mon, 28 Jul 2014, Sergey Beryozkin wrote: This is not an issue that should block the release, I was careful not to vote with a minus one. I've become a bit impatient, but no one really blocks me from completing this pure documentation effort myself, I was hoping that someone would do it first :-). Given that this is a documentation / website enhancement, I don't see any reason why we couldn't post the details for 1.6 (and even perhaps 1.5!) to the site in a few weeks time, irrespective of when the 1.6 release goes out :) Cheers Nick
RE: [VOTE] Apache Tika 1.6 release candidate #1
On Mon, 28 Jul 2014, Allison, Timothy B. wrote: There was one regression: http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx Stacktrace: Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -369073454 at java.lang.String.checkBounds(String.java:371) at java.lang.String.init(String.java:415) at org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114) at org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:163) Any chance you could raise a POI bug for this? We're probably going to do the next POI beta release within a week, so if you hurry it might even get fixed in that... :) Nick
[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor
[ https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077578#comment-14077578 ] Nick Burch commented on TIKA-1369: -- Please send the pull request to the main github repo - https://github.com/apache/tika/ - or post a patch here Please see the Contributing to Apache Tika page - http://tika.apache.org/contribute.html - for more on the various supported ways to build / test / contribute enhancements and fixes! Date parsing and thread safety in ImageMetadataExtractor Key: TIKA-1369 URL: https://issues.apache.org/jira/browse/TIKA-1369 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Environment: OS X 10.9.4 Java 7_60 Reporter: John Gibson Priority: Critical The {{ImageMetadataExtractor}} uses a static instance of {{SimpleDateFormat}}. This is not thread safe. {code:title=ImageMetadataExtractor.java} static class ExifHandler implements DirectoryHandler { private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new SimpleDateFormat(-MM-dd'T'HH:mm:ss); ... public void handleDateTags(Directory directory, Metadata metadata) throws MetadataException { // Date/Time Original overrides value from ExifDirectory.TAG_DATETIME Date original = null; if (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) { original = directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL); // Unless we have GPS time we don't know the time zone so date must be set // as ISO 8601 datetime without timezone suffix (no Z or +/-) if (original != null) { String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor uses metadata.set(TikaCoreProperties.CREATED, datetimeNoTimeZone); metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone); } } ... {code} This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. In the discussion there the idea of using alternative thread-safe (and faster) formatters from either Joda time or Commons Lang were dismissed because they would add too many dependencies. Given that Tika already has a fairly large laundry list of dependencies to parse content, adding one more JAR to make sure things don't break is probably a good idea. In addition, because no timezone or locale are specified by either Tika's formatter or the call to com.drew.metadata.Directory it can wreak havok during randomized testing. Given that the timezone is unknown, why not just default it to UTC and let the caller guess the timezone? As it stands I have to reparse all of the dates into UTC to get stable behavior across timezones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[GitHub] tika pull request: TIKA-1369 Resolve thread safety issue in ImageM...
GitHub user vilmospapp opened a pull request: https://github.com/apache/tika/pull/15 TIKA-1369 Resolve thread safety issue in ImageMetadataExtractor Hi, This fix tries to resolve TIKA-1369 with handle thread safety by ThreadLocal and avoid other library dependencies. I have run the test cases, so it seems correct to me, though I haven't found any other occurrence of ThreadLocal in Tika's source, so perhaps it's against your general patterns. Regards, Vilmos You can merge this pull request into a Git repository by running: $ git pull https://github.com/vilmospapp/tika TIKA-1369 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/15.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15 commit 3a9575fc56a6463b4378b14820e9079352bb1848 Author: Vilmos Papp papp.gyorgy.vil...@gmail.com Date: 2014-07-23T09:18:50Z TIKA-1369 Make SimpleDateFormat usage thread safe --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor
[ https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077584#comment-14077584 ] ASF GitHub Bot commented on TIKA-1369: -- GitHub user vilmospapp opened a pull request: https://github.com/apache/tika/pull/15 TIKA-1369 Resolve thread safety issue in ImageMetadataExtractor Hi, This fix tries to resolve TIKA-1369 with handle thread safety by ThreadLocal and avoid other library dependencies. I have run the test cases, so it seems correct to me, though I haven't found any other occurrence of ThreadLocal in Tika's source, so perhaps it's against your general patterns. Regards, Vilmos You can merge this pull request into a Git repository by running: $ git pull https://github.com/vilmospapp/tika TIKA-1369 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/15.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15 commit 3a9575fc56a6463b4378b14820e9079352bb1848 Author: Vilmos Papp papp.gyorgy.vil...@gmail.com Date: 2014-07-23T09:18:50Z TIKA-1369 Make SimpleDateFormat usage thread safe Date parsing and thread safety in ImageMetadataExtractor Key: TIKA-1369 URL: https://issues.apache.org/jira/browse/TIKA-1369 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Environment: OS X 10.9.4 Java 7_60 Reporter: John Gibson Priority: Critical The {{ImageMetadataExtractor}} uses a static instance of {{SimpleDateFormat}}. This is not thread safe. {code:title=ImageMetadataExtractor.java} static class ExifHandler implements DirectoryHandler { private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new SimpleDateFormat(-MM-dd'T'HH:mm:ss); ... public void handleDateTags(Directory directory, Metadata metadata) throws MetadataException { // Date/Time Original overrides value from ExifDirectory.TAG_DATETIME Date original = null; if (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) { original = directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL); // Unless we have GPS time we don't know the time zone so date must be set // as ISO 8601 datetime without timezone suffix (no Z or +/-) if (original != null) { String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor uses metadata.set(TikaCoreProperties.CREATED, datetimeNoTimeZone); metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone); } } ... {code} This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. In the discussion there the idea of using alternative thread-safe (and faster) formatters from either Joda time or Commons Lang were dismissed because they would add too many dependencies. Given that Tika already has a fairly large laundry list of dependencies to parse content, adding one more JAR to make sure things don't break is probably a good idea. In addition, because no timezone or locale are specified by either Tika's formatter or the call to com.drew.metadata.Directory it can wreak havok during randomized testing. Given that the timezone is unknown, why not just default it to UTC and let the caller guess the timezone? As it stands I have to reparse all of the dates into UTC to get stable behavior across timezones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor
[ https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077587#comment-14077587 ] Vilmos Papp commented on TIKA-1369: --- Hi Nick, Thanks, for the quick answer. I prefer pull request over attachments of patches. Cheers, Vilmos Date parsing and thread safety in ImageMetadataExtractor Key: TIKA-1369 URL: https://issues.apache.org/jira/browse/TIKA-1369 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Environment: OS X 10.9.4 Java 7_60 Reporter: John Gibson Priority: Critical The {{ImageMetadataExtractor}} uses a static instance of {{SimpleDateFormat}}. This is not thread safe. {code:title=ImageMetadataExtractor.java} static class ExifHandler implements DirectoryHandler { private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new SimpleDateFormat(-MM-dd'T'HH:mm:ss); ... public void handleDateTags(Directory directory, Metadata metadata) throws MetadataException { // Date/Time Original overrides value from ExifDirectory.TAG_DATETIME Date original = null; if (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) { original = directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL); // Unless we have GPS time we don't know the time zone so date must be set // as ISO 8601 datetime without timezone suffix (no Z or +/-) if (original != null) { String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor uses metadata.set(TikaCoreProperties.CREATED, datetimeNoTimeZone); metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone); } } ... {code} This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. In the discussion there the idea of using alternative thread-safe (and faster) formatters from either Joda time or Commons Lang were dismissed because they would add too many dependencies. Given that Tika already has a fairly large laundry list of dependencies to parse content, adding one more JAR to make sure things don't break is probably a good idea. In addition, because no timezone or locale are specified by either Tika's formatter or the call to com.drew.metadata.Directory it can wreak havok during randomized testing. Given that the timezone is unknown, why not just default it to UTC and let the caller guess the timezone? As it stands I have to reparse all of the dates into UTC to get stable behavior across timezones. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: [VOTE] Apache Tika 1.6 release candidate #1
Hi On 29/07/14 13:14, Nick Burch wrote: On Mon, 28 Jul 2014, Sergey Beryozkin wrote: This is not an issue that should block the release, I was careful not to vote with a minus one. I've become a bit impatient, but no one really blocks me from completing this pure documentation effort myself, I was hoping that someone would do it first :-). Given that this is a documentation / website enhancement, I don't see any reason why we couldn't post the details for 1.6 (and even perhaps 1.5!) to the site in a few weeks time, irrespective of when the 1.6 release goes out :) Yes, you are right, Cheers, Sergey Cheers Nick
[jira] [Commented] (TIKA-1316) Old Site Code in Trunk
[ https://issues.apache.org/jira/browse/TIKA-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077647#comment-14077647 ] Hudson commented on TIKA-1316: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #119 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/119/]) Remove unused src directory for TIKA-1316. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1614043) * /tika/trunk/src Old Site Code in Trunk -- Key: TIKA-1316 URL: https://issues.apache.org/jira/browse/TIKA-1316 Project: Tika Issue Type: Improvement Components: general Reporter: Tyler Palsulich Assignee: Chris A. Mattmann Priority: Trivial Labels: easyfix Fix For: 1.6 Original Estimate: 1h Remaining Estimate: 1h The \{tika trunk\}/src/site directory seems to old and unused. It does not correspond to the site currently on apache.tika.org (http://svn.apache.org/repos/asf/tika/site/). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1378) MicrosoftTranslator setClient and setId NPE
Chris A. Mattmann created TIKA-1378: --- Summary: MicrosoftTranslator setClient and setId NPE Key: TIKA-1378 URL: https://issues.apache.org/jira/browse/TIKA-1378 Project: Tika Issue Type: Bug Components: translation Environment: Discovered while using https://github.com/chrismattmann/tika-python and https://github.com/chrismattmann/etllib on DARPA XDATA. Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.6 I introduced a bug in MicrosoftTranslator when I was checking for isAvailable in the #setClient and #setId methods that produces and NPE when both aren't set. The Translator still works when auto configured, just not when explicitly configured. I'll add a patch and unit test. (thanks to [~tpalsulich] for the idea on the unit test). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1378) MicrosoftTranslator setClient and setId NPE
[ https://issues.apache.org/jira/browse/TIKA-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1378: Attachment: TIKA-1378.Mattmann.072914.patch.txt - added tests to expose NPE - went ahead and cleaned up the MicrosoftTranslatorTest code - removed System.err.println - explicitly create MicrosoftTranslator instead of through the Tika facade MicrosoftTranslator setClient and setId NPE --- Key: TIKA-1378 URL: https://issues.apache.org/jira/browse/TIKA-1378 Project: Tika Issue Type: Bug Components: translation Environment: Discovered while using https://github.com/chrismattmann/tika-python and https://github.com/chrismattmann/etllib on DARPA XDATA. Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.6 Attachments: TIKA-1378.Mattmann.072914.patch.txt I introduced a bug in MicrosoftTranslator when I was checking for isAvailable in the #setClient and #setId methods that produces and NPE when both aren't set. The Translator still works when auto configured, just not when explicitly configured. I'll add a patch and unit test. (thanks to [~tpalsulich] for the idea on the unit test). -- This message was sent by Atlassian JIRA (v6.2#6252)
Review Request 24051: MicrosoftTranslator setClient and setId NPE
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24051/ --- Review request for tika. Bugs: TIKA-1378 https://issues.apache.org/jira/browse/TIKA-1378 Repository: tika Description --- I introduced a bug into MicrosoftTranslator that creates an NPE when explicitly configuring the translator via the setClientId and setSecret methods. Creating the translator and configuring implicitly with properties still works. This patch fixes the issue and exposes it via a test. Diffs - ./trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java 1614159 ./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java 1614159 Diff: https://reviews.apache.org/r/24051/diff/ Testing --- Tested on DARPA XDATA and via https://github.com/chrismattmann/etllib and https://github.com/chrismattmann/tika-python. Also added unit test: --- T E S T S --- Running org.apache.tika.language.translate.CachedTranslatorTest Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.221 sec Running org.apache.tika.language.translate.GoogleTranslatorTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.025 sec Running org.apache.tika.language.translate.MicrosoftTranslatorTest Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec Results : Tests run: 9, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 8.556s [INFO] Finished at: Tue Jul 29 09:05:20 EDT 2014 [INFO] Final Memory: 24M/194M [INFO] [chipotle:~/src/tika-translate] mattmann% Thanks, Chris Mattmann
[jira] [Commented] (TIKA-1378) MicrosoftTranslator setClient and setId NPE
[ https://issues.apache.org/jira/browse/TIKA-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077690#comment-14077690 ] Chris A. Mattmann commented on TIKA-1378: - https://reviews.apache.org/r/24051/ MicrosoftTranslator setClient and setId NPE --- Key: TIKA-1378 URL: https://issues.apache.org/jira/browse/TIKA-1378 Project: Tika Issue Type: Bug Components: translation Environment: Discovered while using https://github.com/chrismattmann/tika-python and https://github.com/chrismattmann/etllib on DARPA XDATA. Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.6 Attachments: TIKA-1378.Mattmann.072914.patch.txt I introduced a bug in MicrosoftTranslator when I was checking for isAvailable in the #setClient and #setId methods that produces and NPE when both aren't set. The Translator still works when auto configured, just not when explicitly configured. I'll add a patch and unit test. (thanks to [~tpalsulich] for the idea on the unit test). -- This message was sent by Atlassian JIRA (v6.2#6252)
Review Request 24052: Adds basic style support.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24052/ --- Review request for tika. Bugs: TIKA-1063 https://issues.apache.org/jira/browse/TIKA-1063 Repository: tika Description --- Note, I have no idea how to add binary files to the diff (if at all possible). The testStyles.odt is supposed to go into the tika-parsers/src/test/resources/test-documents/ directory. Diffs - trunk/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentContentParser.java 1614327 trunk/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java 1614327 trunk/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java 1614327 Diff: https://reviews.apache.org/r/24052/diff/ Testing --- ODFParserTest.testODTStyles() added. File Attachments testStyles.odt https://reviews.apache.org/media/uploaded/files/2014/07/29/406503ff-2aef-4609-9955-d3a728402bd5__testStyles.odt Thanks, Axel Dörfler
[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077873#comment-14077873 ] Andrés Aguilar-Umaña commented on TIKA-1373: In what version is this going to be released? AutoDetectParser extracts no text when SourceCodeParser is selected --- Key: TIKA-1373 URL: https://issues.apache.org/jira/browse/TIKA-1373 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Andrés Aguilar-Umaña When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} It returns (using the SourceCodeParser): {code} Text extracted: {code} But when I use this code: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} The Text Parser is used and I get: {code} Text extracted: public class HelloWorld {} {code} I have also tested this command: {code} java -jar tika-app-1.5.jar -t D:\text.java (no text) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077885#comment-14077885 ] Hong-Thai Nguyen commented on TIKA-1373: Normally it's on next official 1.6 release, but you can try with this candidate release: http://people.apache.org/~mattmann/apache-tika-1.6/rc1/ AutoDetectParser extracts no text when SourceCodeParser is selected --- Key: TIKA-1373 URL: https://issues.apache.org/jira/browse/TIKA-1373 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Andrés Aguilar-Umaña When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} It returns (using the SourceCodeParser): {code} Text extracted: {code} But when I use this code: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} The Text Parser is used and I get: {code} Text extracted: public class HelloWorld {} {code} I have also tested this command: {code} java -jar tika-app-1.5.jar -t D:\text.java (no text) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1379) error in Tika().detect for xml files with xades signature
Alessandro De Angelis created TIKA-1379: --- Summary: error in Tika().detect for xml files with xades signature Key: TIKA-1379 URL: https://issues.apache.org/jira/browse/TIKA-1379 Project: Tika Issue Type: Bug Affects Versions: 1.4 Reporter: Alessandro De Angelis we tried to get the mime type of an xml file with xades signature embedded. the result is text/html and not the expected text/xml or application/xml. here is an example of the xml file: VERBALI ad_cod=D69017 batch_id=0 cds_cod=D69 data_app=2013-09-23 VERBALE Id=1 tipologia=Verbale esame VERB_NUM00094853 0003 2/VERB_NUM DATA_APP2013-09-23/DATA_APP DATA_ESA2013-09-23/DATA_ESA AD_CODD69017/AD_COD ADFILOSOFIA DELLA SCIENZA/AD CDS_CODD69/CDS_COD CDSTEATRO E ARTI VISIVE/CDS TIPO_ESA/TIPO_ESA MAT1233456/MAT NOMEPAOLINO/NOME COGNOMEPAPERINO/COGNOME VOTO23.0/VOTO VOTODECOD23/VOTODECOD CAUSALE/CAUSALE TIPO_MODULO/TIPO_MODULO IMG_PATH/IMG_PATH AA_SES_ID2012/AA_SES_ID AD_CFU6.0/AD_CFU NOTA/NOTA ATENEO9/ATENEO ATENEO_DESجامعة البندقية - TEST/ATENEO_DES TIPO_DOCUMENTOVerbale_3/TIPO_DOCUMENTO TITOLARE_PROCEDIMENTOQUI QUO QUA/TITOLARE_PROCEDIMENTO AD_STU_CODD69017/AD_STU_COD AD_STUFILOSOFIA DELLA SCIENZA/AD_STU CDS_STU_CODD69/CDS_STU_COD CDS_STUTEATRO E ARTI VISIVE/CDS_STU DOCENTEQUI QUO QUA/DOCENTE DATA_DOCUMENTO26-09-2013 09:55:53 CEST(+0200)/DATA_DOCUMENTO SOFTWARE_DI_CREAZIONE NOME3/NOME VERSIONE11.09.03/VERSIONE /SOFTWARE_DI_CREAZIONE /VERBALEds:Signature xmlns:ds=http://www.w3.org/2000/09/xmldsig#; Id=sig08744308748201048377 ds:SignedInfo ds:CanonicalizationMethod Algorithm=http://www.w3.org/2006/12/xml-c14n11;/ds:CanonicalizationMethod ds:SignatureMethod Algorithm=http://www.w3.org/2001/04/xmldsig-more#rsa-sha256;/ds:SignatureMethod ds:Reference URI= ds:Transforms ds:Transform Algorithm=http://www.w3.org/2002/06/xmldsig-filter2; dsig-xpath:XPath xmlns:dsig-xpath=http://www.w3.org/2002/06/xmldsig-filter2; Filter=subtract/descendant::ds:Signature/dsig-xpath:XPath /ds:Transform ds:Transform Algorithm=http://www.w3.org/TR/1999/REC-xslt-19991116; xsl:stylesheet xmlns:kion=http://www.kion.it/webesse3/multilingua; xmlns:xsl=http://www.w3.org/1999/XSL/Transform; exclude-result-prefixes=kion version=1.0 kion:ml module=FirmaDigitale target=kion/kion:ml xsl:output method=xml/xsl:output xsl:variable name=mostra_ad_figlie select=1/xsl:variable xsl:variable name=verbale_root select=/VERBALI/VERBALE/xsl:variable xsl:variable name=sostituzione_root select=/VERBALI/VERBALE/SOSTITUZIONE_DOCUMENTO/xsl:variable xsl:variable name=RAGG_ROOT select=/VERBALI/VERBALE/RAGGRUPPAMENTO/xsl:variable xsl:variable name=COMM_ROOT select=/VERBALI/VERBALE/COMMISSIONE/xsl:variable xsl:template match=/ html head meta content=text/html;charset=UTF-8 http-equiv=Content-Type/meta xsl:choose xsl:when test=$sostituzione_root titleDichiarazione conformità Verbale Esame/title /xsl:when xsl:otherwise titleVerbalizzazione esame/title /xsl:otherwise /xsl:choose style type=text/css td {font-family: Arial; font-size:10pt;} div {font-family: Arial; font-size:10pt;} pre {font-family: Arial; font-size:10pt;} /style /head body table xsl:choose xsl:when test=$sostituzione_root trtd align=center colspan=2bigstrongxsl:value-of select=$verbale_root/ATENEO_DES/xsl:value-of/strong/bigbr/br/td/tr trtd align=center colspan=2bigstrongDICHIARAZIONE DI CONFORMITÀ/strong/bigbr/br/td/tr trtd align=left colspan=2strongIl sottoscritto xsl:value-of select=$verbale_root/TITOLARE_PROCEDIMENTO/xsl:value-of, docente di xsl:value-of select=$verbale_root/AD/xsl:value-of/strongbr/br
[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078028#comment-14078028 ] Andrés Aguilar-Umaña commented on TIKA-1373: Great! thank you! AutoDetectParser extracts no text when SourceCodeParser is selected --- Key: TIKA-1373 URL: https://issues.apache.org/jira/browse/TIKA-1373 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Andrés Aguilar-Umaña When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} It returns (using the SourceCodeParser): {code} Text extracted: {code} But when I use this code: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} The Text Parser is used and I get: {code} Text extracted: public class HelloWorld {} {code} I have also tested this command: {code} java -jar tika-app-1.5.jar -t D:\text.java (no text) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Review Request 24051: MicrosoftTranslator setClient and setId NPE
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24051/#review49024 --- ./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java https://reviews.apache.org/r/24051/#comment85858 Should add a test for Default Translator. Separate issue. ./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java https://reviews.apache.org/r/24051/#comment85857 Add in a check right here that translator.isAvailable() is false? - Tyler Palsulich On July 29, 2014, 1:09 p.m., Chris Mattmann wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24051/ --- (Updated July 29, 2014, 1:09 p.m.) Review request for tika. Bugs: TIKA-1378 https://issues.apache.org/jira/browse/TIKA-1378 Repository: tika Description --- I introduced a bug into MicrosoftTranslator that creates an NPE when explicitly configuring the translator via the setClientId and setSecret methods. Creating the translator and configuring implicitly with properties still works. This patch fixes the issue and exposes it via a test. Diffs - ./trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java 1614159 ./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java 1614159 Diff: https://reviews.apache.org/r/24051/diff/ Testing --- Tested on DARPA XDATA and via https://github.com/chrismattmann/etllib and https://github.com/chrismattmann/tika-python. Also added unit test: --- T E S T S --- Running org.apache.tika.language.translate.CachedTranslatorTest Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.221 sec Running org.apache.tika.language.translate.GoogleTranslatorTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.025 sec Running org.apache.tika.language.translate.MicrosoftTranslatorTest Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec Results : Tests run: 9, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 8.556s [INFO] Finished at: Tue Jul 29 09:05:20 EDT 2014 [INFO] Final Memory: 24M/194M [INFO] [chipotle:~/src/tika-translate] mattmann% Thanks, Chris Mattmann
Re: Review Request 24051: MicrosoftTranslator setClient and setId NPE
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24051/#review49025 --- Ship it! - Tyler Palsulich On July 29, 2014, 1:09 p.m., Chris Mattmann wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24051/ --- (Updated July 29, 2014, 1:09 p.m.) Review request for tika. Bugs: TIKA-1378 https://issues.apache.org/jira/browse/TIKA-1378 Repository: tika Description --- I introduced a bug into MicrosoftTranslator that creates an NPE when explicitly configuring the translator via the setClientId and setSecret methods. Creating the translator and configuring implicitly with properties still works. This patch fixes the issue and exposes it via a test. Diffs - ./trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java 1614159 ./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java 1614159 Diff: https://reviews.apache.org/r/24051/diff/ Testing --- Tested on DARPA XDATA and via https://github.com/chrismattmann/etllib and https://github.com/chrismattmann/tika-python. Also added unit test: --- T E S T S --- Running org.apache.tika.language.translate.CachedTranslatorTest Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.221 sec Running org.apache.tika.language.translate.GoogleTranslatorTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.025 sec Running org.apache.tika.language.translate.MicrosoftTranslatorTest Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec Results : Tests run: 9, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 8.556s [INFO] Finished at: Tue Jul 29 09:05:20 EDT 2014 [INFO] Final Memory: 24M/194M [INFO] [chipotle:~/src/tika-translate] mattmann% Thanks, Chris Mattmann
[jira] [Commented] (TIKA-1378) MicrosoftTranslator setClient and setId NPE
[ https://issues.apache.org/jira/browse/TIKA-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078633#comment-14078633 ] Hudson commented on TIKA-1378: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #129 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/129/]) - TIKA-1378: MicrosoftTranslator setClient and setId NPE (thanks to tpalsulich for the review!) (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1614488) * /tika/trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java * /tika/trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java MicrosoftTranslator setClient and setId NPE --- Key: TIKA-1378 URL: https://issues.apache.org/jira/browse/TIKA-1378 Project: Tika Issue Type: Bug Components: translation Environment: Discovered while using https://github.com/chrismattmann/tika-python and https://github.com/chrismattmann/etllib on DARPA XDATA. Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.6 Attachments: TIKA-1378.Mattmann.072914.patch.txt I introduced a bug in MicrosoftTranslator when I was checking for isAvailable in the #setClient and #setId methods that produces and NPE when both aren't set. The Translator still works when auto configured, just not when explicitly configured. I'll add a patch and unit test. (thanks to [~tpalsulich] for the idea on the unit test). -- This message was sent by Atlassian JIRA (v6.2#6252)