[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor

2014-07-29 Thread Vilmos Papp (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077561#comment-14077561
 ] 

Vilmos Papp commented on TIKA-1369:
---

Hi,

I've sent a pull request on github to fix this: 
https://github.com/chrismattmann/tika/pull/1, I hope I sent it to the proper 
person, if not, where should I send it?

Regards,
Vilmos

 Date parsing and thread safety in ImageMetadataExtractor
 

 Key: TIKA-1369
 URL: https://issues.apache.org/jira/browse/TIKA-1369
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
 Environment: OS X 10.9.4 Java 7_60
Reporter: John Gibson
Priority: Critical

 The {{ImageMetadataExtractor}} uses a static instance of 
 {{SimpleDateFormat}}.  This is not thread safe.
 {code:title=ImageMetadataExtractor.java}
 static class ExifHandler implements DirectoryHandler {
 private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new 
 SimpleDateFormat(-MM-dd'T'HH:mm:ss);
 ...
 public void handleDateTags(Directory directory, Metadata metadata)
 throws MetadataException {
 // Date/Time Original overrides value from 
 ExifDirectory.TAG_DATETIME
 Date original = null;
 if 
 (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
 original = 
 directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
 // Unless we have GPS time we don't know the time zone so 
 date must be set
 // as ISO 8601 datetime without timezone suffix (no Z or +/-)
 if (original != null) {
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
 uses
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
 }
 }
...
 {code}
 This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. 
 In the discussion there the idea of using alternative thread-safe (and 
 faster) formatters from either Joda time or Commons Lang were dismissed 
 because they would add too many dependencies. Given that Tika already has a 
 fairly large laundry list of dependencies to parse content, adding one more 
 JAR to make sure things don't break is probably a good idea.
 In addition, because no timezone or locale are specified by either Tika's 
 formatter or the call to com.drew.metadata.Directory it can wreak havok 
 during randomized testing. Given that the timezone is unknown, why not just 
 default it to UTC and let the caller guess the timezone? As it stands I have 
 to reparse all of the dates into UTC to get stable behavior across timezones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-29 Thread Nick Burch

On Mon, 28 Jul 2014, Sergey Beryozkin wrote:
This is not an issue that should block the release, I was careful not to 
vote with a minus one. I've become a bit impatient, but no one really 
blocks me from completing this pure documentation effort myself, I was 
hoping that someone would do it first :-).


Given that this is a documentation / website enhancement, I don't see any 
reason why we couldn't post the details for 1.6 (and even perhaps 1.5!) to 
the site in a few weeks time, irrespective of when the 1.6 release goes 
out :)


Cheers
Nick


RE: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-29 Thread Nick Burch

On Mon, 28 Jul 2014, Allison, Timothy B. wrote:

There was one regression:
http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx

Stacktrace:
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of 
range: -369073454
at java.lang.String.checkBounds(String.java:371)
at java.lang.String.init(String.java:415)
at 
org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
at 
org.apache.poi.poifs.filesystem.Ole10Native.init(Ole10Native.java:163)


Any chance you could raise a POI bug for this? We're probably going to do 
the next POI beta release within a week, so if you hurry it might even get 
fixed in that... :)


Nick


[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor

2014-07-29 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077578#comment-14077578
 ] 

Nick Burch commented on TIKA-1369:
--

Please send the pull request to the main github repo - 
https://github.com/apache/tika/ - or post a patch here

Please see the Contributing to Apache Tika page - 
http://tika.apache.org/contribute.html - for more on the various supported ways 
to build / test / contribute enhancements and fixes!

 Date parsing and thread safety in ImageMetadataExtractor
 

 Key: TIKA-1369
 URL: https://issues.apache.org/jira/browse/TIKA-1369
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
 Environment: OS X 10.9.4 Java 7_60
Reporter: John Gibson
Priority: Critical

 The {{ImageMetadataExtractor}} uses a static instance of 
 {{SimpleDateFormat}}.  This is not thread safe.
 {code:title=ImageMetadataExtractor.java}
 static class ExifHandler implements DirectoryHandler {
 private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new 
 SimpleDateFormat(-MM-dd'T'HH:mm:ss);
 ...
 public void handleDateTags(Directory directory, Metadata metadata)
 throws MetadataException {
 // Date/Time Original overrides value from 
 ExifDirectory.TAG_DATETIME
 Date original = null;
 if 
 (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
 original = 
 directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
 // Unless we have GPS time we don't know the time zone so 
 date must be set
 // as ISO 8601 datetime without timezone suffix (no Z or +/-)
 if (original != null) {
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
 uses
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
 }
 }
...
 {code}
 This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. 
 In the discussion there the idea of using alternative thread-safe (and 
 faster) formatters from either Joda time or Commons Lang were dismissed 
 because they would add too many dependencies. Given that Tika already has a 
 fairly large laundry list of dependencies to parse content, adding one more 
 JAR to make sure things don't break is probably a good idea.
 In addition, because no timezone or locale are specified by either Tika's 
 formatter or the call to com.drew.metadata.Directory it can wreak havok 
 during randomized testing. Given that the timezone is unknown, why not just 
 default it to UTC and let the caller guess the timezone? As it stands I have 
 to reparse all of the dates into UTC to get stable behavior across timezones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[GitHub] tika pull request: TIKA-1369 Resolve thread safety issue in ImageM...

2014-07-29 Thread vilmospapp
GitHub user vilmospapp opened a pull request:

https://github.com/apache/tika/pull/15

TIKA-1369 Resolve thread safety issue in ImageMetadataExtractor 

Hi,

This fix tries to resolve TIKA-1369 with handle thread safety by 
ThreadLocal and avoid other library dependencies.

I have run the test cases, so it seems correct to me, though I haven't 
found any other occurrence of ThreadLocal in Tika's source, so perhaps it's 
against your general patterns.

Regards,
Vilmos

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vilmospapp/tika TIKA-1369

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/15.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15


commit 3a9575fc56a6463b4378b14820e9079352bb1848
Author: Vilmos Papp papp.gyorgy.vil...@gmail.com
Date:   2014-07-23T09:18:50Z

TIKA-1369 Make SimpleDateFormat usage thread safe




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor

2014-07-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077584#comment-14077584
 ] 

ASF GitHub Bot commented on TIKA-1369:
--

GitHub user vilmospapp opened a pull request:

https://github.com/apache/tika/pull/15

TIKA-1369 Resolve thread safety issue in ImageMetadataExtractor 

Hi,

This fix tries to resolve TIKA-1369 with handle thread safety by 
ThreadLocal and avoid other library dependencies.

I have run the test cases, so it seems correct to me, though I haven't 
found any other occurrence of ThreadLocal in Tika's source, so perhaps it's 
against your general patterns.

Regards,
Vilmos

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vilmospapp/tika TIKA-1369

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/15.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15


commit 3a9575fc56a6463b4378b14820e9079352bb1848
Author: Vilmos Papp papp.gyorgy.vil...@gmail.com
Date:   2014-07-23T09:18:50Z

TIKA-1369 Make SimpleDateFormat usage thread safe




 Date parsing and thread safety in ImageMetadataExtractor
 

 Key: TIKA-1369
 URL: https://issues.apache.org/jira/browse/TIKA-1369
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
 Environment: OS X 10.9.4 Java 7_60
Reporter: John Gibson
Priority: Critical

 The {{ImageMetadataExtractor}} uses a static instance of 
 {{SimpleDateFormat}}.  This is not thread safe.
 {code:title=ImageMetadataExtractor.java}
 static class ExifHandler implements DirectoryHandler {
 private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new 
 SimpleDateFormat(-MM-dd'T'HH:mm:ss);
 ...
 public void handleDateTags(Directory directory, Metadata metadata)
 throws MetadataException {
 // Date/Time Original overrides value from 
 ExifDirectory.TAG_DATETIME
 Date original = null;
 if 
 (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
 original = 
 directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
 // Unless we have GPS time we don't know the time zone so 
 date must be set
 // as ISO 8601 datetime without timezone suffix (no Z or +/-)
 if (original != null) {
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
 uses
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
 }
 }
...
 {code}
 This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. 
 In the discussion there the idea of using alternative thread-safe (and 
 faster) formatters from either Joda time or Commons Lang were dismissed 
 because they would add too many dependencies. Given that Tika already has a 
 fairly large laundry list of dependencies to parse content, adding one more 
 JAR to make sure things don't break is probably a good idea.
 In addition, because no timezone or locale are specified by either Tika's 
 formatter or the call to com.drew.metadata.Directory it can wreak havok 
 during randomized testing. Given that the timezone is unknown, why not just 
 default it to UTC and let the caller guess the timezone? As it stands I have 
 to reparse all of the dates into UTC to get stable behavior across timezones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor

2014-07-29 Thread Vilmos Papp (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077587#comment-14077587
 ] 

Vilmos Papp commented on TIKA-1369:
---

Hi Nick,

Thanks, for the quick answer. I prefer pull request over attachments of patches.

Cheers,
Vilmos

 Date parsing and thread safety in ImageMetadataExtractor
 

 Key: TIKA-1369
 URL: https://issues.apache.org/jira/browse/TIKA-1369
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
 Environment: OS X 10.9.4 Java 7_60
Reporter: John Gibson
Priority: Critical

 The {{ImageMetadataExtractor}} uses a static instance of 
 {{SimpleDateFormat}}.  This is not thread safe.
 {code:title=ImageMetadataExtractor.java}
 static class ExifHandler implements DirectoryHandler {
 private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new 
 SimpleDateFormat(-MM-dd'T'HH:mm:ss);
 ...
 public void handleDateTags(Directory directory, Metadata metadata)
 throws MetadataException {
 // Date/Time Original overrides value from 
 ExifDirectory.TAG_DATETIME
 Date original = null;
 if 
 (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
 original = 
 directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
 // Unless we have GPS time we don't know the time zone so 
 date must be set
 // as ISO 8601 datetime without timezone suffix (no Z or +/-)
 if (original != null) {
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
 uses
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
 }
 }
...
 {code}
 This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. 
 In the discussion there the idea of using alternative thread-safe (and 
 faster) formatters from either Joda time or Commons Lang were dismissed 
 because they would add too many dependencies. Given that Tika already has a 
 fairly large laundry list of dependencies to parse content, adding one more 
 JAR to make sure things don't break is probably a good idea.
 In addition, because no timezone or locale are specified by either Tika's 
 formatter or the call to com.drew.metadata.Directory it can wreak havok 
 during randomized testing. Given that the timezone is unknown, why not just 
 default it to UTC and let the caller guess the timezone? As it stands I have 
 to reparse all of the dates into UTC to get stable behavior across timezones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-29 Thread Sergey Beryozkin

Hi
On 29/07/14 13:14, Nick Burch wrote:

On Mon, 28 Jul 2014, Sergey Beryozkin wrote:

This is not an issue that should block the release, I was careful not
to vote with a minus one. I've become a bit impatient, but no one
really blocks me from completing this pure documentation effort
myself, I was hoping that someone would do it first :-).


Given that this is a documentation / website enhancement, I don't see
any reason why we couldn't post the details for 1.6 (and even perhaps
1.5!) to the site in a few weeks time, irrespective of when the 1.6
release goes out :)

Yes, you are right,

Cheers, Sergey


Cheers
Nick





[jira] [Commented] (TIKA-1316) Old Site Code in Trunk

2014-07-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077647#comment-14077647
 ] 

Hudson commented on TIKA-1316:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #119 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/119/])
Remove unused src directory for TIKA-1316. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1614043)
* /tika/trunk/src


 Old Site Code in Trunk
 --

 Key: TIKA-1316
 URL: https://issues.apache.org/jira/browse/TIKA-1316
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Tyler Palsulich
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: easyfix
 Fix For: 1.6

   Original Estimate: 1h
  Remaining Estimate: 1h

 The \{tika trunk\}/src/site directory seems to old and unused. It does not 
 correspond to the site currently on apache.tika.org 
 (http://svn.apache.org/repos/asf/tika/site/).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1378) MicrosoftTranslator setClient and setId NPE

2014-07-29 Thread Chris A. Mattmann (JIRA)
Chris A. Mattmann created TIKA-1378:
---

 Summary: MicrosoftTranslator setClient and setId NPE
 Key: TIKA-1378
 URL: https://issues.apache.org/jira/browse/TIKA-1378
 Project: Tika
  Issue Type: Bug
  Components: translation
 Environment: Discovered while using 
https://github.com/chrismattmann/tika-python and 
https://github.com/chrismattmann/etllib on DARPA XDATA.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.6


I introduced a bug in MicrosoftTranslator when I was checking for isAvailable 
in the #setClient and #setId methods that produces and NPE when both aren't 
set. The Translator still works when auto configured, just not when explicitly 
configured.

I'll add a patch and unit test. (thanks to [~tpalsulich] for the idea on the 
unit test).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1378) MicrosoftTranslator setClient and setId NPE

2014-07-29 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1378:


Attachment: TIKA-1378.Mattmann.072914.patch.txt

- added tests to expose NPE
- went ahead and cleaned up the MicrosoftTranslatorTest code 
  - removed System.err.println
  - explicitly create MicrosoftTranslator instead of through the Tika facade

 MicrosoftTranslator setClient and setId NPE
 ---

 Key: TIKA-1378
 URL: https://issues.apache.org/jira/browse/TIKA-1378
 Project: Tika
  Issue Type: Bug
  Components: translation
 Environment: Discovered while using 
 https://github.com/chrismattmann/tika-python and 
 https://github.com/chrismattmann/etllib on DARPA XDATA.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.6

 Attachments: TIKA-1378.Mattmann.072914.patch.txt


 I introduced a bug in MicrosoftTranslator when I was checking for isAvailable 
 in the #setClient and #setId methods that produces and NPE when both aren't 
 set. The Translator still works when auto configured, just not when 
 explicitly configured.
 I'll add a patch and unit test. (thanks to [~tpalsulich] for the idea on the 
 unit test).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Review Request 24051: MicrosoftTranslator setClient and setId NPE

2014-07-29 Thread Chris Mattmann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24051/
---

Review request for tika.


Bugs: TIKA-1378
https://issues.apache.org/jira/browse/TIKA-1378


Repository: tika


Description
---

I introduced a bug into MicrosoftTranslator that creates an NPE when explicitly 
configuring the translator via the setClientId and setSecret methods. Creating 
the translator and configuring implicitly with properties still works. This 
patch fixes the issue and exposes it via a test.


Diffs
-

  
./trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java
 1614159 
  
./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java
 1614159 

Diff: https://reviews.apache.org/r/24051/diff/


Testing
---

Tested on DARPA XDATA and via https://github.com/chrismattmann/etllib and 
https://github.com/chrismattmann/tika-python.
Also added unit test:

---
 T E S T S
---
Running org.apache.tika.language.translate.CachedTranslatorTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.221 sec
Running org.apache.tika.language.translate.GoogleTranslatorTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.025 sec
Running org.apache.tika.language.translate.MicrosoftTranslatorTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec

Results :

Tests run: 9, Failures: 0, Errors: 0, Skipped: 0

[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 8.556s
[INFO] Finished at: Tue Jul 29 09:05:20 EDT 2014
[INFO] Final Memory: 24M/194M
[INFO] 
[chipotle:~/src/tika-translate] mattmann% 


Thanks,

Chris Mattmann



[jira] [Commented] (TIKA-1378) MicrosoftTranslator setClient and setId NPE

2014-07-29 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077690#comment-14077690
 ] 

Chris A. Mattmann commented on TIKA-1378:
-

https://reviews.apache.org/r/24051/

 MicrosoftTranslator setClient and setId NPE
 ---

 Key: TIKA-1378
 URL: https://issues.apache.org/jira/browse/TIKA-1378
 Project: Tika
  Issue Type: Bug
  Components: translation
 Environment: Discovered while using 
 https://github.com/chrismattmann/tika-python and 
 https://github.com/chrismattmann/etllib on DARPA XDATA.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.6

 Attachments: TIKA-1378.Mattmann.072914.patch.txt


 I introduced a bug in MicrosoftTranslator when I was checking for isAvailable 
 in the #setClient and #setId methods that produces and NPE when both aren't 
 set. The Translator still works when auto configured, just not when 
 explicitly configured.
 I'll add a patch and unit test. (thanks to [~tpalsulich] for the idea on the 
 unit test).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Review Request 24052: Adds basic style support.

2014-07-29 Thread Axel Dörfler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24052/
---

Review request for tika.


Bugs: TIKA-1063
https://issues.apache.org/jira/browse/TIKA-1063


Repository: tika


Description
---

Note, I have no idea how to add binary files to the diff (if at all possible). 
The testStyles.odt is supposed to go into the 
tika-parsers/src/test/resources/test-documents/ directory.


Diffs
-

  
trunk/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentContentParser.java
 1614327 
  
trunk/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java
 1614327 
  
trunk/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java 
1614327 

Diff: https://reviews.apache.org/r/24052/diff/


Testing
---

ODFParserTest.testODTStyles() added.


File Attachments


testStyles.odt
  
https://reviews.apache.org/media/uploaded/files/2014/07/29/406503ff-2aef-4609-9955-d3a728402bd5__testStyles.odt


Thanks,

Axel Dörfler



[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077873#comment-14077873
 ] 

Andrés Aguilar-Umaña commented on TIKA-1373:


In what version is this going to be released?

 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 {code}
 It returns (using the SourceCodeParser): 
 {code}  Text extracted: {code}
 But when I use this code:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/plain);
 try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
 catch (Exception e) {  e.printStackTrace();  }
 System.out.println(Text extracted: +bch.toString())
 {code}
 The Text Parser is used and I get:
 {code}  Text extracted: public class HelloWorld {} {code}
 I have also tested this command: 
 {code}
  java -jar tika-app-1.5.jar -t D:\text.java
   (no text)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-29 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077885#comment-14077885
 ] 

Hong-Thai Nguyen commented on TIKA-1373:


Normally it's on next  official 1.6 release, but you can try with this 
candidate release: http://people.apache.org/~mattmann/apache-tika-1.6/rc1/

 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 {code}
 It returns (using the SourceCodeParser): 
 {code}  Text extracted: {code}
 But when I use this code:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/plain);
 try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
 catch (Exception e) {  e.printStackTrace();  }
 System.out.println(Text extracted: +bch.toString())
 {code}
 The Text Parser is used and I get:
 {code}  Text extracted: public class HelloWorld {} {code}
 I have also tested this command: 
 {code}
  java -jar tika-app-1.5.jar -t D:\text.java
   (no text)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1379) error in Tika().detect for xml files with xades signature

2014-07-29 Thread Alessandro De Angelis (JIRA)
Alessandro De Angelis created TIKA-1379:
---

 Summary: error in Tika().detect for xml files with xades signature
 Key: TIKA-1379
 URL: https://issues.apache.org/jira/browse/TIKA-1379
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Alessandro De Angelis


we tried to get the mime type of an xml file with xades signature embedded. the 
result is text/html and not the expected text/xml or application/xml.

here is an example of the xml file:

VERBALI ad_cod=D69017 batch_id=0 cds_cod=D69 data_app=2013-09-23
VERBALE Id=1 tipologia=Verbale esame
VERB_NUM00094853 0003 2/VERB_NUM
DATA_APP2013-09-23/DATA_APP
DATA_ESA2013-09-23/DATA_ESA
AD_CODD69017/AD_COD
ADFILOSOFIA DELLA SCIENZA/AD
CDS_CODD69/CDS_COD
CDSTEATRO E ARTI VISIVE/CDS
TIPO_ESA/TIPO_ESA
MAT1233456/MAT
NOMEPAOLINO/NOME
COGNOMEPAPERINO/COGNOME
VOTO23.0/VOTO
VOTODECOD23/VOTODECOD
CAUSALE/CAUSALE
TIPO_MODULO/TIPO_MODULO
IMG_PATH/IMG_PATH
AA_SES_ID2012/AA_SES_ID
AD_CFU6.0/AD_CFU
NOTA/NOTA
ATENEO9/ATENEO
ATENEO_DESجامعة البندقية - TEST/ATENEO_DES
TIPO_DOCUMENTOVerbale_3/TIPO_DOCUMENTO
TITOLARE_PROCEDIMENTOQUI QUO QUA/TITOLARE_PROCEDIMENTO
AD_STU_CODD69017/AD_STU_COD
AD_STUFILOSOFIA DELLA SCIENZA/AD_STU
CDS_STU_CODD69/CDS_STU_COD
CDS_STUTEATRO E ARTI VISIVE/CDS_STU
DOCENTEQUI QUO QUA/DOCENTE
DATA_DOCUMENTO26-09-2013 09:55:53 CEST(+0200)/DATA_DOCUMENTO
SOFTWARE_DI_CREAZIONE
NOME3/NOME
VERSIONE11.09.03/VERSIONE
/SOFTWARE_DI_CREAZIONE
/VERBALEds:Signature xmlns:ds=http://www.w3.org/2000/09/xmldsig#; 
Id=sig08744308748201048377
ds:SignedInfo
ds:CanonicalizationMethod 
Algorithm=http://www.w3.org/2006/12/xml-c14n11;/ds:CanonicalizationMethod
ds:SignatureMethod 
Algorithm=http://www.w3.org/2001/04/xmldsig-more#rsa-sha256;/ds:SignatureMethod
ds:Reference URI=
ds:Transforms
ds:Transform Algorithm=http://www.w3.org/2002/06/xmldsig-filter2;
dsig-xpath:XPath xmlns:dsig-xpath=http://www.w3.org/2002/06/xmldsig-filter2; 
Filter=subtract/descendant::ds:Signature/dsig-xpath:XPath
/ds:Transform
ds:Transform Algorithm=http://www.w3.org/TR/1999/REC-xslt-19991116;
xsl:stylesheet xmlns:kion=http://www.kion.it/webesse3/multilingua; 
xmlns:xsl=http://www.w3.org/1999/XSL/Transform; exclude-result-prefixes=kion 
version=1.0
kion:ml module=FirmaDigitale target=kion/kion:ml
xsl:output method=xml/xsl:output

xsl:variable name=mostra_ad_figlie select=1/xsl:variable
xsl:variable name=verbale_root 
select=/VERBALI/VERBALE/xsl:variable
xsl:variable name=sostituzione_root 
select=/VERBALI/VERBALE/SOSTITUZIONE_DOCUMENTO/xsl:variable
xsl:variable name=RAGG_ROOT 
select=/VERBALI/VERBALE/RAGGRUPPAMENTO/xsl:variable
xsl:variable name=COMM_ROOT 
select=/VERBALI/VERBALE/COMMISSIONE/xsl:variable

xsl:template match=/
html
head
meta content=text/html;charset=UTF-8 
http-equiv=Content-Type/meta
xsl:choose 
xsl:when 
test=$sostituzione_root
titleDichiarazione 
conformità Verbale Esame/title
/xsl:when
xsl:otherwise
titleVerbalizzazione 
esame/title
/xsl:otherwise
/xsl:choose
style type=text/css
 td  {font-family: Arial; font-size:10pt;} 
 div {font-family: Arial; font-size:10pt;}
 pre {font-family: Arial; font-size:10pt;} 
/style
/head
body
table
xsl:choose 
xsl:when 
test=$sostituzione_root
trtd align=center 
colspan=2bigstrongxsl:value-of 
select=$verbale_root/ATENEO_DES/xsl:value-of/strong/bigbr/br/td/tr
trtd align=center 
colspan=2bigstrongDICHIARAZIONE DI 
CONFORMITÀ/strong/bigbr/br/td/tr
trtd align=left 
colspan=2strongIl sottoscritto xsl:value-of 
select=$verbale_root/TITOLARE_PROCEDIMENTO/xsl:value-of, docente di 
xsl:value-of select=$verbale_root/AD/xsl:value-of/strongbr/br

[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078028#comment-14078028
 ] 

Andrés Aguilar-Umaña commented on TIKA-1373:


Great! thank you!

 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 {code}
 It returns (using the SourceCodeParser): 
 {code}  Text extracted: {code}
 But when I use this code:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/plain);
 try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
 catch (Exception e) {  e.printStackTrace();  }
 System.out.println(Text extracted: +bch.toString())
 {code}
 The Text Parser is used and I get:
 {code}  Text extracted: public class HelloWorld {} {code}
 I have also tested this command: 
 {code}
  java -jar tika-app-1.5.jar -t D:\text.java
   (no text)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Review Request 24051: MicrosoftTranslator setClient and setId NPE

2014-07-29 Thread Tyler Palsulich

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24051/#review49024
---



./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java
https://reviews.apache.org/r/24051/#comment85858

Should add a test for Default Translator. Separate issue.



./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java
https://reviews.apache.org/r/24051/#comment85857

Add in a check right here that translator.isAvailable() is false?


- Tyler Palsulich


On July 29, 2014, 1:09 p.m., Chris Mattmann wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/24051/
 ---
 
 (Updated July 29, 2014, 1:09 p.m.)
 
 
 Review request for tika.
 
 
 Bugs: TIKA-1378
 https://issues.apache.org/jira/browse/TIKA-1378
 
 
 Repository: tika
 
 
 Description
 ---
 
 I introduced a bug into MicrosoftTranslator that creates an NPE when 
 explicitly configuring the translator via the setClientId and setSecret 
 methods. Creating the translator and configuring implicitly with properties 
 still works. This patch fixes the issue and exposes it via a test.
 
 
 Diffs
 -
 
   
 ./trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java
  1614159 
   
 ./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java
  1614159 
 
 Diff: https://reviews.apache.org/r/24051/diff/
 
 
 Testing
 ---
 
 Tested on DARPA XDATA and via https://github.com/chrismattmann/etllib and 
 https://github.com/chrismattmann/tika-python.
 Also added unit test:
 
 ---
  T E S T S
 ---
 Running org.apache.tika.language.translate.CachedTranslatorTest
 Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.221 sec
 Running org.apache.tika.language.translate.GoogleTranslatorTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.025 sec
 Running org.apache.tika.language.translate.MicrosoftTranslatorTest
 Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec
 
 Results :
 
 Tests run: 9, Failures: 0, Errors: 0, Skipped: 0
 
 [INFO] 
 
 [INFO] BUILD SUCCESS
 [INFO] 
 
 [INFO] Total time: 8.556s
 [INFO] Finished at: Tue Jul 29 09:05:20 EDT 2014
 [INFO] Final Memory: 24M/194M
 [INFO] 
 
 [chipotle:~/src/tika-translate] mattmann% 
 
 
 Thanks,
 
 Chris Mattmann
 




Re: Review Request 24051: MicrosoftTranslator setClient and setId NPE

2014-07-29 Thread Tyler Palsulich

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24051/#review49025
---

Ship it!


- Tyler Palsulich


On July 29, 2014, 1:09 p.m., Chris Mattmann wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/24051/
 ---
 
 (Updated July 29, 2014, 1:09 p.m.)
 
 
 Review request for tika.
 
 
 Bugs: TIKA-1378
 https://issues.apache.org/jira/browse/TIKA-1378
 
 
 Repository: tika
 
 
 Description
 ---
 
 I introduced a bug into MicrosoftTranslator that creates an NPE when 
 explicitly configuring the translator via the setClientId and setSecret 
 methods. Creating the translator and configuring implicitly with properties 
 still works. This patch fixes the issue and exposes it via a test.
 
 
 Diffs
 -
 
   
 ./trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java
  1614159 
   
 ./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java
  1614159 
 
 Diff: https://reviews.apache.org/r/24051/diff/
 
 
 Testing
 ---
 
 Tested on DARPA XDATA and via https://github.com/chrismattmann/etllib and 
 https://github.com/chrismattmann/tika-python.
 Also added unit test:
 
 ---
  T E S T S
 ---
 Running org.apache.tika.language.translate.CachedTranslatorTest
 Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.221 sec
 Running org.apache.tika.language.translate.GoogleTranslatorTest
 Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.025 sec
 Running org.apache.tika.language.translate.MicrosoftTranslatorTest
 Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec
 
 Results :
 
 Tests run: 9, Failures: 0, Errors: 0, Skipped: 0
 
 [INFO] 
 
 [INFO] BUILD SUCCESS
 [INFO] 
 
 [INFO] Total time: 8.556s
 [INFO] Finished at: Tue Jul 29 09:05:20 EDT 2014
 [INFO] Final Memory: 24M/194M
 [INFO] 
 
 [chipotle:~/src/tika-translate] mattmann% 
 
 
 Thanks,
 
 Chris Mattmann
 




[jira] [Commented] (TIKA-1378) MicrosoftTranslator setClient and setId NPE

2014-07-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078633#comment-14078633
 ] 

Hudson commented on TIKA-1378:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #129 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/129/])
- TIKA-1378: MicrosoftTranslator setClient and setId NPE (thanks to tpalsulich 
for the review!) (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1614488)
* 
/tika/trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java
* 
/tika/trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java


 MicrosoftTranslator setClient and setId NPE
 ---

 Key: TIKA-1378
 URL: https://issues.apache.org/jira/browse/TIKA-1378
 Project: Tika
  Issue Type: Bug
  Components: translation
 Environment: Discovered while using 
 https://github.com/chrismattmann/tika-python and 
 https://github.com/chrismattmann/etllib on DARPA XDATA.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.6

 Attachments: TIKA-1378.Mattmann.072914.patch.txt


 I introduced a bug in MicrosoftTranslator when I was checking for isAvailable 
 in the #setClient and #setId methods that produces and NPE when both aren't 
 set. The Translator still works when auto configured, just not when 
 explicitly configured.
 I'll add a patch and unit test. (thanks to [~tpalsulich] for the idea on the 
 unit test).



--
This message was sent by Atlassian JIRA
(v6.2#6252)