[jira] [Commented] (TIKA-1378) MicrosoftTranslator setClient and setId NPE

2014-07-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078633#comment-14078633
 ] 

Hudson commented on TIKA-1378:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #129 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/129/])
- TIKA-1378: MicrosoftTranslator setClient and setId NPE (thanks to tpalsulich 
for the review!) (mattmann: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1614488)
* 
/tika/trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java
* 
/tika/trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java


> MicrosoftTranslator setClient and setId NPE
> ---
>
> Key: TIKA-1378
> URL: https://issues.apache.org/jira/browse/TIKA-1378
> Project: Tika
>  Issue Type: Bug
>  Components: translation
> Environment: Discovered while using 
> https://github.com/chrismattmann/tika-python and 
> https://github.com/chrismattmann/etllib on DARPA XDATA.
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.6
>
> Attachments: TIKA-1378.Mattmann.072914.patch.txt
>
>
> I introduced a bug in MicrosoftTranslator when I was checking for isAvailable 
> in the #setClient and #setId methods that produces and NPE when both aren't 
> set. The Translator still works when auto configured, just not when 
> explicitly configured.
> I'll add a patch and unit test. (thanks to [~tpalsulich] for the idea on the 
> unit test).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Review Request 24051: MicrosoftTranslator setClient and setId NPE

2014-07-29 Thread Tyler Palsulich

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24051/#review49025
---

Ship it!


- Tyler Palsulich


On July 29, 2014, 1:09 p.m., Chris Mattmann wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/24051/
> ---
> 
> (Updated July 29, 2014, 1:09 p.m.)
> 
> 
> Review request for tika.
> 
> 
> Bugs: TIKA-1378
> https://issues.apache.org/jira/browse/TIKA-1378
> 
> 
> Repository: tika
> 
> 
> Description
> ---
> 
> I introduced a bug into MicrosoftTranslator that creates an NPE when 
> explicitly configuring the translator via the setClientId and setSecret 
> methods. Creating the translator and configuring implicitly with properties 
> still works. This patch fixes the issue and exposes it via a test.
> 
> 
> Diffs
> -
> 
>   
> ./trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java
>  1614159 
>   
> ./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java
>  1614159 
> 
> Diff: https://reviews.apache.org/r/24051/diff/
> 
> 
> Testing
> ---
> 
> Tested on DARPA XDATA and via https://github.com/chrismattmann/etllib and 
> https://github.com/chrismattmann/tika-python.
> Also added unit test:
> 
> ---
>  T E S T S
> ---
> Running org.apache.tika.language.translate.CachedTranslatorTest
> Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.221 sec
> Running org.apache.tika.language.translate.GoogleTranslatorTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.025 sec
> Running org.apache.tika.language.translate.MicrosoftTranslatorTest
> Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec
> 
> Results :
> 
> Tests run: 9, Failures: 0, Errors: 0, Skipped: 0
> 
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time: 8.556s
> [INFO] Finished at: Tue Jul 29 09:05:20 EDT 2014
> [INFO] Final Memory: 24M/194M
> [INFO] 
> 
> [chipotle:~/src/tika-translate] mattmann% 
> 
> 
> Thanks,
> 
> Chris Mattmann
> 
>



Re: Review Request 24051: MicrosoftTranslator setClient and setId NPE

2014-07-29 Thread Tyler Palsulich

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24051/#review49024
---



./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java


Should add a test for Default Translator. Separate issue.



./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java


Add in a check right here that translator.isAvailable() is false?


- Tyler Palsulich


On July 29, 2014, 1:09 p.m., Chris Mattmann wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/24051/
> ---
> 
> (Updated July 29, 2014, 1:09 p.m.)
> 
> 
> Review request for tika.
> 
> 
> Bugs: TIKA-1378
> https://issues.apache.org/jira/browse/TIKA-1378
> 
> 
> Repository: tika
> 
> 
> Description
> ---
> 
> I introduced a bug into MicrosoftTranslator that creates an NPE when 
> explicitly configuring the translator via the setClientId and setSecret 
> methods. Creating the translator and configuring implicitly with properties 
> still works. This patch fixes the issue and exposes it via a test.
> 
> 
> Diffs
> -
> 
>   
> ./trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java
>  1614159 
>   
> ./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java
>  1614159 
> 
> Diff: https://reviews.apache.org/r/24051/diff/
> 
> 
> Testing
> ---
> 
> Tested on DARPA XDATA and via https://github.com/chrismattmann/etllib and 
> https://github.com/chrismattmann/tika-python.
> Also added unit test:
> 
> ---
>  T E S T S
> ---
> Running org.apache.tika.language.translate.CachedTranslatorTest
> Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.221 sec
> Running org.apache.tika.language.translate.GoogleTranslatorTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.025 sec
> Running org.apache.tika.language.translate.MicrosoftTranslatorTest
> Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec
> 
> Results :
> 
> Tests run: 9, Failures: 0, Errors: 0, Skipped: 0
> 
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time: 8.556s
> [INFO] Finished at: Tue Jul 29 09:05:20 EDT 2014
> [INFO] Final Memory: 24M/194M
> [INFO] 
> 
> [chipotle:~/src/tika-translate] mattmann% 
> 
> 
> Thanks,
> 
> Chris Mattmann
> 
>



[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078028#comment-14078028
 ] 

Andrés Aguilar-Umaña commented on TIKA-1373:


Great! thank you!

> AutoDetectParser extracts no text when SourceCodeParser is selected
> ---
>
> Key: TIKA-1373
> URL: https://issues.apache.org/jira/browse/TIKA-1373
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Andrés Aguilar-Umaña
>
> When using the AutoDetectParser in java code, and the SourceCodeParser is 
> selected (i.e. java files), the handler gets no text:
> I have this test program:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
> try {
>autoDetectParser.parse(bais, bch, metadata, parseContext);
> } catch (Exception e) {
>e.printStackTrace();
> }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> It returns (using the SourceCodeParser): 
> {code} > Text extracted: {code}
> But when I use this code:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/plain");
> try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
> catch (Exception e) {  e.printStackTrace();  }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> The Text Parser is used and I get:
> {code} > Text extracted: public class HelloWorld {} {code}
> I have also tested this command: 
> {code}
> > java -jar tika-app-1.5.jar -t D:\text.java
>   (no text)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1379) error in Tika().detect for xml files with xades signature

2014-07-29 Thread Alessandro De Angelis (JIRA)
Alessandro De Angelis created TIKA-1379:
---

 Summary: error in Tika().detect for xml files with xades signature
 Key: TIKA-1379
 URL: https://issues.apache.org/jira/browse/TIKA-1379
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Alessandro De Angelis


we tried to get the mime type of an xml file with xades signature embedded. the 
result is "text/html" and not the expected "text/xml" or "application/xml".

here is an example of the xml file:



00094853 0003 2
2013-09-23
2013-09-23
D69017
FILOSOFIA DELLA SCIENZA
D69
TEATRO E ARTI VISIVE

1233456
PAOLINO
PAPERINO
23.0
23



2012
6.0

9
جامعة البندقية - TEST
Verbale_3
QUI QUO QUA
D69017
FILOSOFIA DELLA SCIENZA
D69
TEATRO E ARTI VISIVE
QUI QUO QUA
26-09-2013 09:55:53 CEST(+0200)

3
11.09.03

http://www.w3.org/2000/09/xmldsig#"; 
Id="sig08744308748201048377">

http://www.w3.org/2006/12/xml-c14n11";>
http://www.w3.org/2001/04/xmldsig-more#rsa-sha256";>


http://www.w3.org/2002/06/xmldsig-filter2";>
http://www.w3.org/2002/06/xmldsig-filter2"; 
Filter="subtract">/descendant::ds:Signature

http://www.w3.org/TR/1999/REC-xslt-19991116";>
http://www.kion.it/webesse3/multilingua"; 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; exclude-result-prefixes="kion" 
version="1.0">













 

Dichiarazione 
conformità Verbale Esame


Verbalizzazione 
esame



 td  {font-family: Arial; font-size:10pt;} 
 div {font-family: Arial; font-size:10pt;}
 pre {font-family: Arial; font-size:10pt;} 




 


DICHIARAZIONE DI 
CONFORMITÀ
Il sottoscritto , docente di 

   


  



PREMESSO CHE

 



 






DICHIARA
 

- 
(**)
- 
che il verbale in calce, firmato digitalmente dal sottoscritto, sostituisce a 
tutti gli effetti di legge quello precedentemente firmato, indicato nella linea 
precedente e conservato a norma
- 
A maggior tutela del firmatario viene riportata la versione originale e gli 
estremi dell'ultima versione firmata
 





  

[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-29 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077885#comment-14077885
 ] 

Hong-Thai Nguyen commented on TIKA-1373:


Normally it's on next  official 1.6 release, but you can try with this 
candidate release: http://people.apache.org/~mattmann/apache-tika-1.6/rc1/

> AutoDetectParser extracts no text when SourceCodeParser is selected
> ---
>
> Key: TIKA-1373
> URL: https://issues.apache.org/jira/browse/TIKA-1373
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Andrés Aguilar-Umaña
>
> When using the AutoDetectParser in java code, and the SourceCodeParser is 
> selected (i.e. java files), the handler gets no text:
> I have this test program:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
> try {
>autoDetectParser.parse(bais, bch, metadata, parseContext);
> } catch (Exception e) {
>e.printStackTrace();
> }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> It returns (using the SourceCodeParser): 
> {code} > Text extracted: {code}
> But when I use this code:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/plain");
> try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
> catch (Exception e) {  e.printStackTrace();  }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> The Text Parser is used and I get:
> {code} > Text extracted: public class HelloWorld {} {code}
> I have also tested this command: 
> {code}
> > java -jar tika-app-1.5.jar -t D:\text.java
>   (no text)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077873#comment-14077873
 ] 

Andrés Aguilar-Umaña commented on TIKA-1373:


In what version is this going to be released?

> AutoDetectParser extracts no text when SourceCodeParser is selected
> ---
>
> Key: TIKA-1373
> URL: https://issues.apache.org/jira/browse/TIKA-1373
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Andrés Aguilar-Umaña
>
> When using the AutoDetectParser in java code, and the SourceCodeParser is 
> selected (i.e. java files), the handler gets no text:
> I have this test program:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source");
> try {
>autoDetectParser.parse(bais, bch, metadata, parseContext);
> } catch (Exception e) {
>e.printStackTrace();
> }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> It returns (using the SourceCodeParser): 
> {code} > Text extracted: {code}
> But when I use this code:
> {code}
> String data = "public class HelloWorld {}";
> ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
> Parser autoDetectParser = new AutoDetectParser();
> BodyContentHandler bch = new BodyContentHandler(50);
> ParseContext parseContext = new ParseContext();
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/plain");
> try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
> catch (Exception e) {  e.printStackTrace();  }
> System.out.println("Text extracted: "+bch.toString())
> {code}
> The Text Parser is used and I get:
> {code} > Text extracted: public class HelloWorld {} {code}
> I have also tested this command: 
> {code}
> > java -jar tika-app-1.5.jar -t D:\text.java
>   (no text)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Review Request 24052: Adds basic style support.

2014-07-29 Thread Axel Dörfler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24052/
---

Review request for tika.


Bugs: TIKA-1063
https://issues.apache.org/jira/browse/TIKA-1063


Repository: tika


Description
---

Note, I have no idea how to add binary files to the diff (if at all possible). 
The testStyles.odt is supposed to go into the 
"tika-parsers/src/test/resources/test-documents/" directory.


Diffs
-

  
trunk/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentContentParser.java
 1614327 
  
trunk/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java
 1614327 
  
trunk/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java 
1614327 

Diff: https://reviews.apache.org/r/24052/diff/


Testing
---

ODFParserTest.testODTStyles() added.


File Attachments


testStyles.odt
  
https://reviews.apache.org/media/uploaded/files/2014/07/29/406503ff-2aef-4609-9955-d3a728402bd5__testStyles.odt


Thanks,

Axel Dörfler



[jira] [Commented] (TIKA-1378) MicrosoftTranslator setClient and setId NPE

2014-07-29 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077690#comment-14077690
 ] 

Chris A. Mattmann commented on TIKA-1378:
-

https://reviews.apache.org/r/24051/

> MicrosoftTranslator setClient and setId NPE
> ---
>
> Key: TIKA-1378
> URL: https://issues.apache.org/jira/browse/TIKA-1378
> Project: Tika
>  Issue Type: Bug
>  Components: translation
> Environment: Discovered while using 
> https://github.com/chrismattmann/tika-python and 
> https://github.com/chrismattmann/etllib on DARPA XDATA.
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.6
>
> Attachments: TIKA-1378.Mattmann.072914.patch.txt
>
>
> I introduced a bug in MicrosoftTranslator when I was checking for isAvailable 
> in the #setClient and #setId methods that produces and NPE when both aren't 
> set. The Translator still works when auto configured, just not when 
> explicitly configured.
> I'll add a patch and unit test. (thanks to [~tpalsulich] for the idea on the 
> unit test).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Review Request 24051: MicrosoftTranslator setClient and setId NPE

2014-07-29 Thread Chris Mattmann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24051/
---

Review request for tika.


Bugs: TIKA-1378
https://issues.apache.org/jira/browse/TIKA-1378


Repository: tika


Description
---

I introduced a bug into MicrosoftTranslator that creates an NPE when explicitly 
configuring the translator via the setClientId and setSecret methods. Creating 
the translator and configuring implicitly with properties still works. This 
patch fixes the issue and exposes it via a test.


Diffs
-

  
./trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java
 1614159 
  
./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java
 1614159 

Diff: https://reviews.apache.org/r/24051/diff/


Testing
---

Tested on DARPA XDATA and via https://github.com/chrismattmann/etllib and 
https://github.com/chrismattmann/tika-python.
Also added unit test:

---
 T E S T S
---
Running org.apache.tika.language.translate.CachedTranslatorTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.221 sec
Running org.apache.tika.language.translate.GoogleTranslatorTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.025 sec
Running org.apache.tika.language.translate.MicrosoftTranslatorTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec

Results :

Tests run: 9, Failures: 0, Errors: 0, Skipped: 0

[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 8.556s
[INFO] Finished at: Tue Jul 29 09:05:20 EDT 2014
[INFO] Final Memory: 24M/194M
[INFO] 
[chipotle:~/src/tika-translate] mattmann% 


Thanks,

Chris Mattmann



[jira] [Updated] (TIKA-1378) MicrosoftTranslator setClient and setId NPE

2014-07-29 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1378:


Attachment: TIKA-1378.Mattmann.072914.patch.txt

- added tests to expose NPE
- went ahead and cleaned up the MicrosoftTranslatorTest code 
  - removed System.err.println
  - explicitly create MicrosoftTranslator instead of through the Tika facade

> MicrosoftTranslator setClient and setId NPE
> ---
>
> Key: TIKA-1378
> URL: https://issues.apache.org/jira/browse/TIKA-1378
> Project: Tika
>  Issue Type: Bug
>  Components: translation
> Environment: Discovered while using 
> https://github.com/chrismattmann/tika-python and 
> https://github.com/chrismattmann/etllib on DARPA XDATA.
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.6
>
> Attachments: TIKA-1378.Mattmann.072914.patch.txt
>
>
> I introduced a bug in MicrosoftTranslator when I was checking for isAvailable 
> in the #setClient and #setId methods that produces and NPE when both aren't 
> set. The Translator still works when auto configured, just not when 
> explicitly configured.
> I'll add a patch and unit test. (thanks to [~tpalsulich] for the idea on the 
> unit test).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1378) MicrosoftTranslator setClient and setId NPE

2014-07-29 Thread Chris A. Mattmann (JIRA)
Chris A. Mattmann created TIKA-1378:
---

 Summary: MicrosoftTranslator setClient and setId NPE
 Key: TIKA-1378
 URL: https://issues.apache.org/jira/browse/TIKA-1378
 Project: Tika
  Issue Type: Bug
  Components: translation
 Environment: Discovered while using 
https://github.com/chrismattmann/tika-python and 
https://github.com/chrismattmann/etllib on DARPA XDATA.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.6


I introduced a bug in MicrosoftTranslator when I was checking for isAvailable 
in the #setClient and #setId methods that produces and NPE when both aren't 
set. The Translator still works when auto configured, just not when explicitly 
configured.

I'll add a patch and unit test. (thanks to [~tpalsulich] for the idea on the 
unit test).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1316) Old Site Code in Trunk

2014-07-29 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077647#comment-14077647
 ] 

Hudson commented on TIKA-1316:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #119 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/119/])
Remove unused src directory for TIKA-1316. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1614043)
* /tika/trunk/src


> Old Site Code in Trunk
> --
>
> Key: TIKA-1316
> URL: https://issues.apache.org/jira/browse/TIKA-1316
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Tyler Palsulich
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: easyfix
> Fix For: 1.6
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The \{tika trunk\}/src/site directory seems to old and unused. It does not 
> correspond to the site currently on apache.tika.org 
> (http://svn.apache.org/repos/asf/tika/site/).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-29 Thread Sergey Beryozkin

Hi
On 29/07/14 13:14, Nick Burch wrote:

On Mon, 28 Jul 2014, Sergey Beryozkin wrote:

This is not an issue that should block the release, I was careful not
to vote with a minus one. I've become a bit impatient, but no one
really blocks me from completing this pure documentation effort
myself, I was hoping that someone would do it first :-).


Given that this is a documentation / website enhancement, I don't see
any reason why we couldn't post the details for 1.6 (and even perhaps
1.5!) to the site in a few weeks time, irrespective of when the 1.6
release goes out :)

Yes, you are right,

Cheers, Sergey


Cheers
Nick





[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor

2014-07-29 Thread Vilmos Papp (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077587#comment-14077587
 ] 

Vilmos Papp commented on TIKA-1369:
---

Hi Nick,

Thanks, for the quick answer. I prefer pull request over attachments of patches.

Cheers,
Vilmos

> Date parsing and thread safety in ImageMetadataExtractor
> 
>
> Key: TIKA-1369
> URL: https://issues.apache.org/jira/browse/TIKA-1369
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
> Environment: OS X 10.9.4 Java 7_60
>Reporter: John Gibson
>Priority: Critical
>
> The {{ImageMetadataExtractor}} uses a static instance of 
> {{SimpleDateFormat}}.  This is not thread safe.
> {code:title=ImageMetadataExtractor.java}
> static class ExifHandler implements DirectoryHandler {
> private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new 
> SimpleDateFormat("-MM-dd'T'HH:mm:ss");
> ...
> public void handleDateTags(Directory directory, Metadata metadata)
> throws MetadataException {
> // Date/Time Original overrides value from 
> ExifDirectory.TAG_DATETIME
> Date original = null;
> if 
> (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
> original = 
> directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
> // Unless we have GPS time we don't know the time zone so 
> date must be set
> // as ISO 8601 datetime without timezone suffix (no Z or +/-)
> if (original != null) {
> String datetimeNoTimeZone = 
> DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
> uses
> metadata.set(TikaCoreProperties.CREATED, 
> datetimeNoTimeZone);
> metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
> }
> }
>...
> {code}
> This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. 
> In the discussion there the idea of using alternative thread-safe (and 
> faster) formatters from either Joda time or Commons Lang were dismissed 
> because they would add too many dependencies. Given that Tika already has a 
> fairly large laundry list of dependencies to parse content, adding one more 
> JAR to make sure things don't break is probably a good idea.
> In addition, because no timezone or locale are specified by either Tika's 
> formatter or the call to com.drew.metadata.Directory it can wreak havok 
> during randomized testing. Given that the timezone is unknown, why not just 
> default it to UTC and let the caller guess the timezone? As it stands I have 
> to reparse all of the dates into UTC to get stable behavior across timezones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor

2014-07-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077584#comment-14077584
 ] 

ASF GitHub Bot commented on TIKA-1369:
--

GitHub user vilmospapp opened a pull request:

https://github.com/apache/tika/pull/15

TIKA-1369 Resolve thread safety issue in ImageMetadataExtractor 

Hi,

This fix tries to resolve TIKA-1369 with handle thread safety by 
ThreadLocal and avoid other library dependencies.

I have run the test cases, so it seems correct to me, though I haven't 
found any other occurrence of ThreadLocal in Tika's source, so perhaps it's 
against your general patterns.

Regards,
Vilmos

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vilmospapp/tika TIKA-1369

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/15.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15


commit 3a9575fc56a6463b4378b14820e9079352bb1848
Author: Vilmos Papp 
Date:   2014-07-23T09:18:50Z

TIKA-1369 Make SimpleDateFormat usage thread safe




> Date parsing and thread safety in ImageMetadataExtractor
> 
>
> Key: TIKA-1369
> URL: https://issues.apache.org/jira/browse/TIKA-1369
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
> Environment: OS X 10.9.4 Java 7_60
>Reporter: John Gibson
>Priority: Critical
>
> The {{ImageMetadataExtractor}} uses a static instance of 
> {{SimpleDateFormat}}.  This is not thread safe.
> {code:title=ImageMetadataExtractor.java}
> static class ExifHandler implements DirectoryHandler {
> private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new 
> SimpleDateFormat("-MM-dd'T'HH:mm:ss");
> ...
> public void handleDateTags(Directory directory, Metadata metadata)
> throws MetadataException {
> // Date/Time Original overrides value from 
> ExifDirectory.TAG_DATETIME
> Date original = null;
> if 
> (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
> original = 
> directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
> // Unless we have GPS time we don't know the time zone so 
> date must be set
> // as ISO 8601 datetime without timezone suffix (no Z or +/-)
> if (original != null) {
> String datetimeNoTimeZone = 
> DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
> uses
> metadata.set(TikaCoreProperties.CREATED, 
> datetimeNoTimeZone);
> metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
> }
> }
>...
> {code}
> This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. 
> In the discussion there the idea of using alternative thread-safe (and 
> faster) formatters from either Joda time or Commons Lang were dismissed 
> because they would add too many dependencies. Given that Tika already has a 
> fairly large laundry list of dependencies to parse content, adding one more 
> JAR to make sure things don't break is probably a good idea.
> In addition, because no timezone or locale are specified by either Tika's 
> formatter or the call to com.drew.metadata.Directory it can wreak havok 
> during randomized testing. Given that the timezone is unknown, why not just 
> default it to UTC and let the caller guess the timezone? As it stands I have 
> to reparse all of the dates into UTC to get stable behavior across timezones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[GitHub] tika pull request: TIKA-1369 Resolve thread safety issue in ImageM...

2014-07-29 Thread vilmospapp
GitHub user vilmospapp opened a pull request:

https://github.com/apache/tika/pull/15

TIKA-1369 Resolve thread safety issue in ImageMetadataExtractor 

Hi,

This fix tries to resolve TIKA-1369 with handle thread safety by 
ThreadLocal and avoid other library dependencies.

I have run the test cases, so it seems correct to me, though I haven't 
found any other occurrence of ThreadLocal in Tika's source, so perhaps it's 
against your general patterns.

Regards,
Vilmos

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vilmospapp/tika TIKA-1369

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/15.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15


commit 3a9575fc56a6463b4378b14820e9079352bb1848
Author: Vilmos Papp 
Date:   2014-07-23T09:18:50Z

TIKA-1369 Make SimpleDateFormat usage thread safe




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor

2014-07-29 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077578#comment-14077578
 ] 

Nick Burch commented on TIKA-1369:
--

Please send the pull request to the main github repo - 
https://github.com/apache/tika/ - or post a patch here

Please see the Contributing to Apache Tika page - 
http://tika.apache.org/contribute.html - for more on the various supported ways 
to build / test / contribute enhancements and fixes!

> Date parsing and thread safety in ImageMetadataExtractor
> 
>
> Key: TIKA-1369
> URL: https://issues.apache.org/jira/browse/TIKA-1369
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
> Environment: OS X 10.9.4 Java 7_60
>Reporter: John Gibson
>Priority: Critical
>
> The {{ImageMetadataExtractor}} uses a static instance of 
> {{SimpleDateFormat}}.  This is not thread safe.
> {code:title=ImageMetadataExtractor.java}
> static class ExifHandler implements DirectoryHandler {
> private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new 
> SimpleDateFormat("-MM-dd'T'HH:mm:ss");
> ...
> public void handleDateTags(Directory directory, Metadata metadata)
> throws MetadataException {
> // Date/Time Original overrides value from 
> ExifDirectory.TAG_DATETIME
> Date original = null;
> if 
> (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
> original = 
> directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
> // Unless we have GPS time we don't know the time zone so 
> date must be set
> // as ISO 8601 datetime without timezone suffix (no Z or +/-)
> if (original != null) {
> String datetimeNoTimeZone = 
> DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
> uses
> metadata.set(TikaCoreProperties.CREATED, 
> datetimeNoTimeZone);
> metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
> }
> }
>...
> {code}
> This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. 
> In the discussion there the idea of using alternative thread-safe (and 
> faster) formatters from either Joda time or Commons Lang were dismissed 
> because they would add too many dependencies. Given that Tika already has a 
> fairly large laundry list of dependencies to parse content, adding one more 
> JAR to make sure things don't break is probably a good idea.
> In addition, because no timezone or locale are specified by either Tika's 
> formatter or the call to com.drew.metadata.Directory it can wreak havok 
> during randomized testing. Given that the timezone is unknown, why not just 
> default it to UTC and let the caller guess the timezone? As it stands I have 
> to reparse all of the dates into UTC to get stable behavior across timezones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


RE: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-29 Thread Nick Burch

On Mon, 28 Jul 2014, Allison, Timothy B. wrote:

There was one regression:
http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx

Stacktrace:
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of 
range: -369073454
at java.lang.String.checkBounds(String.java:371)
at java.lang.String.(String.java:415)
at 
org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
at 
org.apache.poi.poifs.filesystem.Ole10Native.(Ole10Native.java:163)


Any chance you could raise a POI bug for this? We're probably going to do 
the next POI beta release within a week, so if you hurry it might even get 
fixed in that... :)


Nick


Re: [VOTE] Apache Tika 1.6 release candidate #1

2014-07-29 Thread Nick Burch

On Mon, 28 Jul 2014, Sergey Beryozkin wrote:
This is not an issue that should block the release, I was careful not to 
vote with a minus one. I've become a bit impatient, but no one really 
blocks me from completing this pure documentation effort myself, I was 
hoping that someone would do it first :-).


Given that this is a documentation / website enhancement, I don't see any 
reason why we couldn't post the details for 1.6 (and even perhaps 1.5!) to 
the site in a few weeks time, irrespective of when the 1.6 release goes 
out :)


Cheers
Nick


[jira] [Commented] (TIKA-1369) Date parsing and thread safety in ImageMetadataExtractor

2014-07-29 Thread Vilmos Papp (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077561#comment-14077561
 ] 

Vilmos Papp commented on TIKA-1369:
---

Hi,

I've sent a pull request on github to fix this: 
https://github.com/chrismattmann/tika/pull/1, I hope I sent it to the proper 
person, if not, where should I send it?

Regards,
Vilmos

> Date parsing and thread safety in ImageMetadataExtractor
> 
>
> Key: TIKA-1369
> URL: https://issues.apache.org/jira/browse/TIKA-1369
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
> Environment: OS X 10.9.4 Java 7_60
>Reporter: John Gibson
>Priority: Critical
>
> The {{ImageMetadataExtractor}} uses a static instance of 
> {{SimpleDateFormat}}.  This is not thread safe.
> {code:title=ImageMetadataExtractor.java}
> static class ExifHandler implements DirectoryHandler {
> private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new 
> SimpleDateFormat("-MM-dd'T'HH:mm:ss");
> ...
> public void handleDateTags(Directory directory, Metadata metadata)
> throws MetadataException {
> // Date/Time Original overrides value from 
> ExifDirectory.TAG_DATETIME
> Date original = null;
> if 
> (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
> original = 
> directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
> // Unless we have GPS time we don't know the time zone so 
> date must be set
> // as ISO 8601 datetime without timezone suffix (no Z or +/-)
> if (original != null) {
> String datetimeNoTimeZone = 
> DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
> uses
> metadata.set(TikaCoreProperties.CREATED, 
> datetimeNoTimeZone);
> metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
> }
> }
>...
> {code}
> This is not the first time that SDF has caused problems: TIKA-495, TIKA-864. 
> In the discussion there the idea of using alternative thread-safe (and 
> faster) formatters from either Joda time or Commons Lang were dismissed 
> because they would add too many dependencies. Given that Tika already has a 
> fairly large laundry list of dependencies to parse content, adding one more 
> JAR to make sure things don't break is probably a good idea.
> In addition, because no timezone or locale are specified by either Tika's 
> formatter or the call to com.drew.metadata.Directory it can wreak havok 
> during randomized testing. Given that the timezone is unknown, why not just 
> default it to UTC and let the caller guess the timezone? As it stands I have 
> to reparse all of the dates into UTC to get stable behavior across timezones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)