[jira] [Created] (TIKA-1553) Let's add an evil parser to be used in testing parser drivers

2015-02-20 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1553:
-

 Summary: Let's add an evil parser to be used in testing parser 
drivers
 Key: TIKA-1553
 URL: https://issues.apache.org/jira/browse/TIKA-1553
 Project: Tika
  Issue Type: Test
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor


As part of TIKA-1302 and as part of making Tika more robust generally, it would 
be useful to have an evil parser that will throw exceptions/errors and hang for 
lengths of time.  

This will allow us to test timeouts and handling of exceptions and errors in 
tika-server and in tika-batch.  

We could also use this for tests with ForkParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1553) Let's add an evil parser to be used in testing parser drivers

2015-02-20 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1553.
---
Resolution: Fixed

r1661129

 Let's add an evil parser to be used in testing parser drivers
 -

 Key: TIKA-1553
 URL: https://issues.apache.org/jira/browse/TIKA-1553
 Project: Tika
  Issue Type: Test
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor

 As part of TIKA-1302 and as part of making Tika more robust generally, it 
 would be useful to have an evil parser that will throw exceptions/errors and 
 hang for lengths of time.  
 This will allow us to test timeouts and handling of exceptions and errors in 
 tika-server and in tika-batch.  
 We could also use this for tests with ForkParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1553) Let's add an evil parser to be used in testing parser drivers

2015-02-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328991#comment-14328991
 ] 

Hudson commented on TIKA-1553:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #499 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/499/])
TIKA-1553: add an EvilParser for testing purposes (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1661129)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/TikaTest.java
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/evil
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/evil/EvilParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/evil/EvilParserTest.java
* /tika/trunk/tika-parsers/src/test/resources/META-INF
* /tika/trunk/tika-parsers/src/test/resources/META-INF/services
* 
/tika/trunk/tika-parsers/src/test/resources/META-INF/services/org.apache.tika.parser.Parser
* /tika/trunk/tika-parsers/src/test/resources/org
* /tika/trunk/tika-parsers/src/test/resources/org/apache
* /tika/trunk/tika-parsers/src/test/resources/org/apache/tika
* /tika/trunk/tika-parsers/src/test/resources/org/apache/tika/mime
* 
/tika/trunk/tika-parsers/src/test/resources/org/apache/tika/mime/custom-mimetypes.xml
* /tika/trunk/tika-parsers/src/test/resources/test-documents/evil
* /tika/trunk/tika-parsers/src/test/resources/test-documents/evil/fake_oom.evil
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/evil/heavy_hang.evil
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/evil/nothing_bad.evil
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/evil/null_pointer.evil
* 
/tika/trunk/tika-parsers/src/test/resources/test-documents/evil/null_pointer_no_msg.evil
* /tika/trunk/tika-parsers/src/test/resources/test-documents/evil/real_oom.evil
* /tika/trunk/tika-parsers/src/test/resources/test-documents/evil/sleep.evil


 Let's add an evil parser to be used in testing parser drivers
 -

 Key: TIKA-1553
 URL: https://issues.apache.org/jira/browse/TIKA-1553
 Project: Tika
  Issue Type: Test
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor

 As part of TIKA-1302 and as part of making Tika more robust generally, it 
 would be useful to have an evil parser that will throw exceptions/errors and 
 hang for lengths of time.  
 This will allow us to test timeouts and handling of exceptions and errors in 
 tika-server and in tika-batch.  
 We could also use this for tests with ForkParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1557) Create TesseractOCR Option to Never Run

2015-02-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329523#comment-14329523
 ] 

Uwe Schindler commented on TIKA-1557:
-

I would not make this a special option only for tesseract. As said on 
TIKA-1555, it would be better to have a general way to blacklist some parsers 
through TikaConfig.

Currently you have to maintain the whole list of parsers (or parse META-INF 
yourself) and pass the full list to TikaConfig / AutodetectParser / 
CompositeParser. I would like to have an option in TIKA config to blacklist 
parsers. Ideally this should work alos for subclasses, so one could disable all 
ForkParser subclasses by adding ForkParser to blacklist.

 Create TesseractOCR Option to Never Run
 ---

 Key: TIKA-1557
 URL: https://issues.apache.org/jira/browse/TIKA-1557
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
Priority: Minor
 Fix For: 1.8


 As brought up in TIKA-1555, TesseractOCRParser should have an option to never 
 be run. So, we can add an {{enabled}} option to the Config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1557) Create TesseractOCR Option to Never Run

2015-02-20 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329547#comment-14329547
 ] 

Luis Filipe Nassif commented on TIKA-1557:
--

I think the same problem that happens with TesseractOCRParser can occur with 
any ExternalParser, like StringsParser or ffmpeg. Maybe it will be better to 
add this option to ExternalParser?

 Create TesseractOCR Option to Never Run
 ---

 Key: TIKA-1557
 URL: https://issues.apache.org/jira/browse/TIKA-1557
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
Priority: Minor
 Fix For: 1.8


 As brought up in TIKA-1555, TesseractOCRParser should have an option to never 
 be run. So, we can add an {{enabled}} option to the Config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1557) Create TesseractOCR Option to Never Run

2015-02-20 Thread David Pilato (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329509#comment-14329509
 ] 

David Pilato commented on TIKA-1557:


Thanks! I'd not qualify it as a bug though. :)

 Create TesseractOCR Option to Never Run
 ---

 Key: TIKA-1557
 URL: https://issues.apache.org/jira/browse/TIKA-1557
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
Priority: Minor
 Fix For: 1.8


 As brought up in TIKA-1555, TesseractOCRParser should have an option to never 
 be run. So, we can add an {{enabled}} option to the Config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1557) Create TesseractOCR Option to Never Run

2015-02-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329523#comment-14329523
 ] 

Uwe Schindler edited comment on TIKA-1557 at 2/20/15 9:05 PM:
--

I would not make this a special option only for tesseract. As said on 
TIKA-1555, it would be better to have a general way to blacklist some parsers 
through TikaConfig.

Currently you have to maintain the whole list of parsers (or parse META-INF 
yourself) and pass the full list to TikaConfig / AutodetectParser / 
CompositeParser. I would like to have an option in TIKA config to blacklist 
parsers. Ideally this should also work for subclasses, so one could disable all 
ExternalParser subclasses by adding ExternalParser to blacklist.


was (Author: thetaphi):
I would not make this a special option only for tesseract. As said on 
TIKA-1555, it would be better to have a general way to blacklist some parsers 
through TikaConfig.

Currently you have to maintain the whole list of parsers (or parse META-INF 
yourself) and pass the full list to TikaConfig / AutodetectParser / 
CompositeParser. I would like to have an option in TIKA config to blacklist 
parsers. Ideally this should also work for subclasses, so one could disable all 
ForkParser subclasses by adding ForkParser to blacklist.

 Create TesseractOCR Option to Never Run
 ---

 Key: TIKA-1557
 URL: https://issues.apache.org/jira/browse/TIKA-1557
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
Priority: Minor
 Fix For: 1.8


 As brought up in TIKA-1555, TesseractOCRParser should have an option to never 
 be run. So, we can add an {{enabled}} option to the Config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1558) Create a Parser Blacklist

2015-02-20 Thread Tyler Palsulich (JIRA)
Tyler Palsulich created TIKA-1558:
-

 Summary: Create a Parser Blacklist
 Key: TIKA-1558
 URL: https://issues.apache.org/jira/browse/TIKA-1558
 Project: Tika
  Issue Type: New Feature
Reporter: Tyler Palsulich


As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
disable Parsers without pulling their dependencies out. In some cases (e.g. 
disable all ExternalParsers), there may not be an easy way to exclude the 
dependencies via Maven.

So, an initial design would be to include another file like 
{{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new 
method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
{{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that 
are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1557) Create TesseractOCR Option to Never Run

2015-02-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329523#comment-14329523
 ] 

Uwe Schindler edited comment on TIKA-1557 at 2/20/15 8:42 PM:
--

I would not make this a special option only for tesseract. As said on 
TIKA-1555, it would be better to have a general way to blacklist some parsers 
through TikaConfig.

Currently you have to maintain the whole list of parsers (or parse META-INF 
yourself) and pass the full list to TikaConfig / AutodetectParser / 
CompositeParser. I would like to have an option in TIKA config to blacklist 
parsers. Ideally this should also work for subclasses, so one could disable all 
ForkParser subclasses by adding ForkParser to blacklist.


was (Author: thetaphi):
I would not make this a special option only for tesseract. As said on 
TIKA-1555, it would be better to have a general way to blacklist some parsers 
through TikaConfig.

Currently you have to maintain the whole list of parsers (or parse META-INF 
yourself) and pass the full list to TikaConfig / AutodetectParser / 
CompositeParser. I would like to have an option in TIKA config to blacklist 
parsers. Ideally this should work alos for subclasses, so one could disable all 
ForkParser subclasses by adding ForkParser to blacklist.

 Create TesseractOCR Option to Never Run
 ---

 Key: TIKA-1557
 URL: https://issues.apache.org/jira/browse/TIKA-1557
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
Priority: Minor
 Fix For: 1.8


 As brought up in TIKA-1555, TesseractOCRParser should have an option to never 
 be run. So, we can add an {{enabled}} option to the Config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1557) Create TesseractOCR Option to Never Run

2015-02-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1557:
--
Issue Type: New Feature  (was: Bug)

 Create TesseractOCR Option to Never Run
 ---

 Key: TIKA-1557
 URL: https://issues.apache.org/jira/browse/TIKA-1557
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
Priority: Minor
 Fix For: 1.8


 As brought up in TIKA-1555, TesseractOCRParser should have an option to never 
 be run. So, we can add an {{enabled}} option to the Config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1557) Create TesseractOCR Option to Never Run

2015-02-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1557.
-
   Resolution: Won't Fix
Fix Version/s: (was: 1.8)

Closing this as Won't Fix for a clean record. I'll open a new issue regarding 
a Parser blacklist.

 Create TesseractOCR Option to Never Run
 ---

 Key: TIKA-1557
 URL: https://issues.apache.org/jira/browse/TIKA-1557
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
Priority: Minor
 Attachments: TIKA-1557.palsulich.patch


 As brought up in TIKA-1555, TesseractOCRParser should have an option to never 
 be run. So, we can add an {{enabled}} option to the Config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1558) Create a Parser Blacklist

2015-02-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330010#comment-14330010
 ] 

Hudson commented on TIKA-1558:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #501 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/501/])
TIKA-1558. Enable blacklisting of Parsers and other services with a 
servicename.blacklist META-INF file. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1661284)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-core/src/main/java/org/apache/tika/config/ServiceLoader.java
* 
/tika/trunk/tika-core/src/test/java/org/apache/tika/parser/BlacklistedParser.java
* 
/tika/trunk/tika-core/src/test/java/org/apache/tika/parser/BlacklistedParserSubclass.java
* 
/tika/trunk/tika-core/src/test/java/org/apache/tika/parser/BlacklistedParserTest.java
* /tika/trunk/tika-core/src/test/resources/META-INF
* /tika/trunk/tika-core/src/test/resources/META-INF/services
* 
/tika/trunk/tika-core/src/test/resources/META-INF/services/org.apache.tika.parser.Parser
* 
/tika/trunk/tika-core/src/test/resources/META-INF/services/org.apache.tika.parser.Parser.blacklist
* 
/tika/trunk/tika-core/src/test/resources/org/apache/tika/mime/custom-mimetypes.xml
* 
/tika/trunk/tika-core/src/test/resources/org/apache/tika/parser/blacklist2_file.blacklist2
* 
/tika/trunk/tika-core/src/test/resources/org/apache/tika/parser/blacklist_file.blacklist


 Create a Parser Blacklist
 -

 Key: TIKA-1558
 URL: https://issues.apache.org/jira/browse/TIKA-1558
 Project: Tika
  Issue Type: New Feature
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
 Fix For: 1.8


 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
 disable Parsers without pulling their dependencies out. In some cases (e.g. 
 disable all ExternalParsers), there may not be an easy way to exclude the 
 dependencies via Maven.
 So, an initial design would be to include another file like 
 {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
 new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
 {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
 that are assignable to an element in 
 {{ServiceLoader#loadServiceProviderBlacklist}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1187) java.lang.OutOfMemoryError: Java heap space

2015-02-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1187.
-
Resolution: Cannot Reproduce

 java.lang.OutOfMemoryError: Java heap space
 ---

 Key: TIKA-1187
 URL: https://issues.apache.org/jira/browse/TIKA-1187
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.3
 Environment: Ubuntu 
Reporter: GURFAN
Priority: Critical
   Original Estimate: 612h
  Remaining Estimate: 612h

 Hi,
 While parsing the content we are getting below exception in parse method.
 The file which we are parsing is 1 mb.
 TIKA JAR:  tika-core-1.3.jar
 File size: 1 MB.
 Parser parser = new AutoDetectParser();
 parser.parse(is, handler, metaData, new ParseContext());
 java.lang.OutOfMemoryError: Java heap space
   at java.util.Arrays.copyOf(Arrays.java:2734)
   at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
   at java.util.ArrayList.add(ArrayList.java:351)
   at 
 org.apache.fontbox.ttf.GlyfCompositeDescript.(GlyfCompositeDescript.java:60)
   at org.apache.fontbox.ttf.GlyphData.initData(GlyphData.java:63)
   at org.apache.fontbox.ttf.GlyphTable.initData(GlyphTable.java:71)
   at 
 org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:163)
   at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:61)
   at 
 org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:90)
   at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:26)
   at 
 org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:66)
   at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:26)
   at 
 org.apache.tika.parser.font.TrueTypeParser.parse(TrueTypeParser.java:65)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at 
 com.impetus.vajra.parser.tika.TikaParser.processContent(TikaParser.java:96)
   at 
 com.impetus.vajra.storm.helper.TextAnalyserBoltHelper.execute(TextAnalyserBoltHelper.java:283)
   at 
 com.impetus.vajra.storm.TextAnalyserBolt.execute(TextAnalyserBolt.java:182)
   at 
 backtype.storm.daemon.executor$fn__4050$tuple_action_fn__4052.invoke(executor.clj:566)
   at 
 backtype.storm.daemon.executor$mk_task_receiver$fn__3976.invoke(executor.clj:345)
   at 
 backtype.storm.disruptor$clojure_handler$reify__1606.onEvent(disruptor.clj:43)
   at 
 backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:84)
   at 
 backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:58)
   at 
 backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:62)
   at 
 backtype.storm.daemon.executor$fn__4050$fn__4059$fn__4106.invoke(executor.clj:658)
   at backtype.storm.util$async_loop$fn__465.invoke(util.clj:377)
   at clojure.lang.AFn.run(AFn.java:24)
   at java.lang.Thread.run(Thread.java:662)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1250) Process loops infintely processing a CHM file

2015-02-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1250.
-
Resolution: Cannot Reproduce

We can't reproduce this without the file. And, there were some significant CHM 
parsing updates. So, I'm closing this off.

 Process loops infintely processing a CHM file
 -

 Key: TIKA-1250
 URL: https://issues.apache.org/jira/browse/TIKA-1250
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
 Environment: Java 7 on Linux
Reporter: Gary Murphy
Priority: Critical

 Parsing process loops infinitely on certain CHM files.  This is NOT the same 
 as TIKA-1152



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file

2015-02-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330021#comment-14330021
 ] 

Tyler Palsulich commented on TIKA-1194:
---

[~tssk], were you ever able to create a safe version of the file? /Do you 
still have it? It's been a while since this issue was opened.

 Missing text from MS Word (DOC) file
 

 Key: TIKA-1194
 URL: https://issues.apache.org/jira/browse/TIKA-1194
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Tomas Safarik
Priority: Critical

 Hello,
 we noticed that filtered text from some MS Word DOC files is missing one line 
 (in table cell) in the original document.
 - If you add or remove one character anywhere before the problematic 
 line/cell then the filtered text is correct. If you get the text back to 
 original the filtering problem is back.
 - If the file is resaved as DOCX filtering works fine.
 I will provide sample document. And please let me know if more information is 
 needed.
 Regards,
 Tomas



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1239) Using Spring and Tika together. Need to extract the content and metadata.

2015-02-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1239.
-
Resolution: Cannot Reproduce

 Using Spring and Tika together. Need to extract the content and metadata. 
 --

 Key: TIKA-1239
 URL: https://issues.apache.org/jira/browse/TIKA-1239
 Project: Tika
  Issue Type: Task
  Components: general, metadata, parser
Reporter: sudheshna iyer
Priority: Critical

 I need to use spring with Tika. Is it thread safe to use the following 
 injected from bean context. I am injecting parseContext, handler and parser 
 into my class TikaImpl. 
 
 bean name=parseContext class=org.apache.tika.parser.ParseContext/bean
   bean name=parser 
 class=org.apache.tika.parser.AutoDetectParser/bean
   bean name=handler class=org.xml.sax.helpers.DefaultHandler/bean
   
   bean id=tikaService class=com.intech.tika.TikaImpl
   property name=parseContext ref=parseContext/property
   property name=parser ref=parser/property
   property name=handler ref=handler/property
   property name=resourcesizevalue10485760/value/property
 /bean
 ===
 In my class I have 3 methods 1. To retrieve metadata 2. to retrieve content 
 3. to retrieve both.
 So for 1. Retrieve metadata, I am using: 
 parser.parse(stream, handler,
   metadata, parseContext)
 2. To retrieve the content, i am using: 
 Tika tika = new Tika();
 tika.setMaxStringLength(resourcesize);
 String content = tika.parseToString(stream);
 3. To retrieve both: I am using: 
 BodyContentHandler bodyContentHandler = new BodyContentHandler(resourcesize);
 Metadata metadata = new Metadata();
 parser.parse(TikaInputStream.get(stream), bodyContentHandler, metadata, 
 parseContext);
 Question is: 
 Is my approach thread safe? Introduced 3 methods, thinking that just getting 
 metadata from the first method is faster than the 3rd method. 
 Need your suggestion badly. Thank you in advance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1558) Create a Parser Blacklist

2015-02-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1558.
---
   Resolution: Fixed
Fix Version/s: 1.8
 Assignee: Tyler Palsulich

Above strategy added in r1661284. You can now blacklist Parsers by adding names 
to {{META-INF/services/org.apache.tika.parser.Parser.blacklist}} with the same 
format as the normal services file. If a class is blacklisted, all of its 
subclasses are automatically blacklisted.

 Create a Parser Blacklist
 -

 Key: TIKA-1558
 URL: https://issues.apache.org/jira/browse/TIKA-1558
 Project: Tika
  Issue Type: New Feature
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
 Fix For: 1.8


 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
 disable Parsers without pulling their dependencies out. In some cases (e.g. 
 disable all ExternalParsers), there may not be an easy way to exclude the 
 dependencies via Maven.
 So, an initial design would be to include another file like 
 {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
 new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
 {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
 that are assignable to an element in 
 {{ServiceLoader#loadServiceProviderBlacklist}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1437) encoding issue in AutoDetectReader

2015-02-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330017#comment-14330017
 ] 

Tyler Palsulich commented on TIKA-1437:
---

[~Lukeliush], can you make a couple updates to make this easier to test? First, 
come up with a small (few line) file with this problem. That way, we can be 
sure we can legally include the file within Tika. Also, can you reformat your 
testing script as a Tika JUnit TestCase? You can see an example 
[here|https://github.com/apache/tika/blob/trunk/tika-parsers/src/test/java/org/apache/tika/parser/txt/CharsetDetectorTest.java].

The file you have might just be corrupted -- giving different results. And, as 
Tim mentioned, no detector will be perfect, so different detectors will give 
different results. But, the above changes will help us narrow it down. Thanks!

 encoding issue in AutoDetectReader
 --

 Key: TIKA-1437
 URL: https://issues.apache.org/jira/browse/TIKA-1437
 Project: Tika
  Issue Type: Bug
  Components: detector, parser
Affects Versions: 1.6
 Environment: Windows 8
Reporter: Luke sh
Priority: Critical
 Attachments: EncodingProblem.java, computrabajo-ar-20121108.tsv, 
 e9.jpg, ef.jpg


 We are having an encoding problem with Tika AutoDetectReader;
 we are using AutoDetectReader to read an stream to extract the string values 
 by calling readLine()::AutoDetectReader. We find that the Encoding problem is 
 happening in UniversalEncodingDetector being called by AutoDetectReader when 
 reading the input stream being passed as one of the arguments in our 
 TSVParser’s parse method. 
 We are using AutoDetectReader in our parser and we believed it was able auto 
 detect an correct encoding from the input stream being passed to it, but we 
 are seeing several garbled chars bubbling up in our outputted and converted 
 files from our parser; we find out that the encoding problem is happening in 
 the UniversalEncodingDetector, which returns an UTF-8 and AutoDetectReader is 
 reading the stream with UTF-8 which is incorrect encoding; and the correct 
 encoding is ISO-8859-1.
 I am attaching the screenshot of what char difference we are seeing in the 
 input tsv file and converted/outputed file. they are e9.jpg and ef.jpg, 
 please read the description for details.
 The problem is that the AutoDetectReader is decoding and reading the chars 
 with incorrect encoding. 
 BTW, We were able to work around this problem with CharsetDetector, which 
 seems to generate a valid encoding for the moment with which we can use to 
 read the tsv file properly.
 However, the problem is we cannot use AutoDetectReader, we have to create our 
 own TSVAutoDetectReader incorporated with CharsetDetector in the detect 
 method; AutoDetectReader class seems to be less flexible for us to extend its 
 functions, many of its methods are restricted with private constraints, we 
 cannot manually set encoding or override the existing implementation for 
 detecting encoding.
 In addition, I am also not confident about CharsetDetector either; as I am 
 seeing different encodings produced by CharsetDetector and AutoDetectReader 
 for different tsv files; But for now, we might live with CharsetDetector, as 
 CharsetDetector is solving the current encoding problem.
 Finally, I would like to also please give you my test program (PFA: 
 EncodingProblem.java) that reads an inputted tsv directory and displays a 
 list of encodings for each of the tsv files in the directory produced by 
 AutoDetectReader, UniversalEncodingDetector(which is being called by 
 AutoDetectReader) and CharsetDetector; so you could probably see the 
 difference, they are producing different encodings for some tsv files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1554) Improve EMF file detection

2015-02-20 Thread Luis Filipe Nassif (JIRA)
Luis Filipe Nassif created TIKA-1554:


 Summary: Improve EMF file detection
 Key: TIKA-1554
 URL: https://issues.apache.org/jira/browse/TIKA-1554
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 1.7
Reporter: Luis Filipe Nassif


I am getting many files being incorrectly detected as application/x-emf. I 
think the current magic is too common. According to MS documentation 
(https://msdn.microsoft.com/en-us/library/cc230635.aspx and 
https://msdn.microsoft.com/en-us/library/dd240211.aspx), it can be improved to:
{code}
mime-type type=application/x-emf
acronymEMF/acronym
_commentExtended Metafile/_comment
glob pattern=*.emf/
magic priority=50
  match value=0x0100 type=string offset=0
match value= EMF type=string offset=40/
  /match
/magic
  /mime-type
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread David Pilato (JIRA)
David Pilato created TIKA-1555:
--

 Summary: posix_spawn is not a supported process launch mechanism 
on this platform
 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato


It could happen on some systems that posix_spawn is not a supported process 
launch mechanism.

We are doing random testing which simulates different kind of Locale so I could 
sometime hit that issue:

{code}
java.lang.Error: posix_spawn is not a supported process launch mechanism on 
this platform.
at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
at java.security.AccessController.doPrivileged(Native Method)
at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
at java.lang.Runtime.exec(Runtime.java:617)
at java.lang.Runtime.exec(Runtime.java:485)
at 
org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
at 
org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
at 
org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
at 
org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
at 
org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
at 
org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
at 
org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
at 
org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.Tika.parseToString(Tika.java:506)
{code}

It sounds like it's related to this: 
http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/

Though I have hard time to reproduce it!

BTW I wonder if we could add a setting which can return {{false}} for 
{{TesseractOCRParser#hasTesseract}} even if we have tesseract available.

For example, let say that my machine shares multiple application and for one of 
them I don't want any OCR on my documents.

Hope this helps.
Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1554) Improve EMF file detection

2015-02-20 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329138#comment-14329138
 ] 

Nick Burch commented on TIKA-1554:
--

Do you have any small files which incorrectly trigger it now? One of those 
would be good for a unit test for this!

 Improve EMF file detection
 --

 Key: TIKA-1554
 URL: https://issues.apache.org/jira/browse/TIKA-1554
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 1.7
Reporter: Luis Filipe Nassif

 I am getting many files being incorrectly detected as application/x-emf. I 
 think the current magic is too common. According to MS documentation 
 (https://msdn.microsoft.com/en-us/library/cc230635.aspx and 
 https://msdn.microsoft.com/en-us/library/dd240211.aspx), it can be improved 
 to:
 {code}
 mime-type type=application/x-emf
 acronymEMF/acronym
 _commentExtended Metafile/_comment
 glob pattern=*.emf/
 magic priority=50
   match value=0x0100 type=string offset=0
   match value= EMF type=string offset=40/
   /match
 /magic
   /mime-type
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1554) Improve EMF file detection

2015-02-20 Thread Luis Filipe Nassif (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Filipe Nassif updated TIKA-1554:
-
Attachment: nonEmf.dat

Yes, I have attached one very simple, constituted only by the current 4 bytes 
magic.

 Improve EMF file detection
 --

 Key: TIKA-1554
 URL: https://issues.apache.org/jira/browse/TIKA-1554
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 1.7
Reporter: Luis Filipe Nassif
 Attachments: nonEmf.dat


 I am getting many files being incorrectly detected as application/x-emf. I 
 think the current magic is too common. According to MS documentation 
 (https://msdn.microsoft.com/en-us/library/cc230635.aspx and 
 https://msdn.microsoft.com/en-us/library/dd240211.aspx), it can be improved 
 to:
 {code}
 mime-type type=application/x-emf
 acronymEMF/acronym
 _commentExtended Metafile/_comment
 glob pattern=*.emf/
 magic priority=50
   match value=0x0100 type=string offset=0
   match value= EMF type=string offset=40/
   /match
 /magic
   /mime-type
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1460) Could not parse predefined CMAP file for 'Adobe-GBK1-UCS2'

2015-02-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330031#comment-14330031
 ] 

Tyler Palsulich commented on TIKA-1460:
---

Hi [~onyas]. The dialog isn't in a very intuitive spot. It's under More  
Attach files. I found a PostScript version of the file under 
{{/usr/share/fonts/cmap/}}. But, not a PDF. I'm also curious if a newer version 
of Tika would solve your problem.

 Could not parse predefined CMAP file for 'Adobe-GBK1-UCS2'
 --

 Key: TIKA-1460
 URL: https://issues.apache.org/jira/browse/TIKA-1460
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: win7,myeclipse8.5
Reporter: onyas
Priority: Critical

 for some reason,I could not upload the file,Here is the info..
 and i checked all the version in the directory of 
 \org\apache\pdfbox\resources\cmap, I have not found the ’Adobe-GBK1-UCS2‘ file
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
 org.apache.tika.parser.microsoft.OfficeParser@d640af
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 Caused by: java.lang.IllegalArgumentException: Position 66048 past the end of 
 the file
   at 
 org.apache.poi.poifs.nio.FileBackedDataSource.read(FileBackedDataSource.java:50)
   at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt(NPOIFSFileSystem.java:420)
   at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readBAT(NPOIFSFileSystem.java:397)
   at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:356)
   at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:202)
   at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:184)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   ... 21 more
 the major code is :
 Parser parser = new AutoDetectParser();
   ContentHandler handler = new BodyContentHandler(getNum());
   Metadata metadata = new Metadata();
   ParseContext context = new ParseContext();
   InputStream stream = null;
   StringBuffer content = new StringBuffer();
   try {
   stream = new FileInputStream(file);
   if (stream != null) {
   parser.parse(stream, handler, metadata, 
 context);
   content = content.append(handler);
   
   if(StringUtils.isNotBlank(content.toString())){
   hasContent = true;
   handler = null;
   metadata = null;
   context = null;
   }
   }
 And the exception is throwed at this line== parser.parse(stream, handler, 
 metadata, context);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1521) Handle password protected 7zip files

2015-02-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1521.
---
Resolution: Fixed

Thanks for finding a workaround, Tim! Closing this now that Jenkins is happy.

 Handle password protected 7zip files
 

 Key: TIKA-1521
 URL: https://issues.apache.org/jira/browse/TIKA-1521
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.7
Reporter: Nick Burch
 Fix For: 1.8


 While working on TIKA-1028, I notice that while Commons Compress doesn't 
 currently handle decrypting password protected zip files, it does handle 
 password protected 7zip files
 We should therefore add logic into the package parser to spot password 
 protected 7zip files, and fetch the password for them from a PasswordProvider 
 if given



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329276#comment-14329276
 ] 

Uwe Schindler commented on TIKA-1555:
-

Also, this issue in the JDK is already fixed in Java 7u80 and 8u40 (to be 
released in the next 2 months): https://bugs.openjdk.java.net/browse/JDK-8047340

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329282#comment-14329282
 ] 

Uwe Schindler commented on TIKA-1555:
-

@UweSays: https://twitter.com/UweSays/status/501425093613207552

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread David Pilato (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329300#comment-14329300
 ] 

David Pilato commented on TIKA-1555:


Thank you Uwe. I don't understand why I was not able to find the other issue!
I'm pretty sure I search for it before opening that one...

I guess we can close this one as duplicate then?

Thanks!

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329272#comment-14329272
 ] 

Uwe Schindler commented on TIKA-1555:
-

This is a duplicate of TIKA-1526.

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1555.
-
Resolution: Duplicate
  Assignee: Tyler Palsulich

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
Assignee: Tyler Palsulich
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-02-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329350#comment-14329350
 ] 

Uwe Schindler commented on TIKA-1526:
-

I was not able to test this, because I have no MacOSX computer and FreeBSD is 
only a Jenkins server

Maybe [~dadoonet] can try the same with elasticsearch-mapper-attachments module.

 ExternalParser should trap/ignore/workarround JDK-8047340  JDK-8055301 so 
 Turkish Tika users can still use non-external parsers
 

 Key: TIKA-1526
 URL: https://issues.apache.org/jira/browse/TIKA-1526
 Project: Tika
  Issue Type: Wish
Reporter: Hoss Man

 the JDK has numerous pain points regarding the Turkish locale, posix_spawn 
 lowercasing being one of them...
 https://bugs.openjdk.java.net/browse/JDK-8047340
 https://bugs.openjdk.java.net/browse/JDK-8055301
 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
 enabled  configured by default in Tika, and uses ExternalParser.check to see 
 if tesseract is available -- but because of the JDK bug, this means that Tika 
 fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
 so...
 {noformat}
   [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported 
 process launch mechanism on this platform.
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
   [junit4]   at java.security.AccessController.doPrivileged(Native 
 Method)
   [junit4]   at java.lang.UNIXProcess.clinit(UNIXProcess.java:92)
   [junit4]   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   [junit4]   at 
 java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:620)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:485)
   [junit4]   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   [junit4]   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 {noformat}
 ...unless they go out of their way to white list only the parsers they 
 need/want so TesseractOCRParser (and any other ExternalParsers) will never 
 even be check()ed.
 It would be nice if Tika's ExternalParser class added a similar 
 hack/workarround to what was done in SOLR-6387 to trap these types of errors. 
  In Solr we just propogate a better error explaining why Java hates the 
 turkish langauge...
 {code}
 } catch (Error err) {
   if (err.getMessage() != null  (err.getMessage().contains(posix_spawn) 
 || err.getMessage().contains(UNIXProcess))) {
 log.warn(Error forking command due to JVM locale bug (see 
 https://issues.apache.org/jira/browse/SOLR-6387):  + err.getMessage());
 return (error executing:  + cmd + );
   }
 }
 {code}
 ...but with Tika, it might be better for all ExternalParsers to just opt 
 out as if they don't recognize the filetype when they detect this type of 
 error fro m the check method (or perhaps it would be better if 
 AutoDetectParser handled this? ... i'm not really sure how it would best fit 
 into Tika's architecture)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329344#comment-14329344
 ] 

Uwe Schindler commented on TIKA-1555:
-

Hi David,
can you try to compile Tika from current trunk checkout and test it with ES? If 
this fixes the issue with turkish locale, could you report on TIKA-1526. For me 
its hard to reproduce with Windows or Linux. I just have analyzed the issue and 
reported the bug to Oracle and fixed Solr 5.0, but I did no thorough testing on 
the Tika issue.

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
Assignee: Tyler Palsulich
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1323) Improve exception reporting in JAX-RS server

2015-02-20 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1323.
---
Resolution: Fixed

r1661193

Commandline option -includeStack will enable this behavior.  I centralized and 
made exception handling more uniform for parsing for /tika, /unpack and /rmeta. 
 /meta is still slightly different for backwards compatibility.

 Improve exception reporting in JAX-RS server
 

 Key: TIKA-1323
 URL: https://issues.apache.org/jira/browse/TIKA-1323
 Project: Tika
  Issue Type: Improvement
  Components: server
Reporter: Tim Allison
Priority: Minor

 I'd like to use tika-server for TIKA-1302.  As part of that, I'd like to 
 record exception stacktraces per document.  I see two options: transmit the 
 info back to the client (assuming a doc didn't bring the server down :) ) 
 along with the current error code or log the document id and stacktrace via 
 the server.  Given my current design thoughts, I'd prefer the first option.
 Any objections or recommendations?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1556) Clean up whitespace in tika-server

2015-02-20 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1556.
---
Resolution: Fixed

r1661200.

 Clean up whitespace in tika-server
 --

 Key: TIKA-1556
 URL: https://issues.apache.org/jira/browse/TIKA-1556
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial
 Fix For: 1.8


 We have 2- and 4-space indents in different parts of tika-server's code.  
 Let's make consistent with rest of Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329364#comment-14329364
 ] 

Uwe Schindler commented on TIKA-1555:
-

bq. BTW I wonder if we could add a setting which can return false for 
TesseractOCRParser#hasTesseract even if we have tesseract available.

You can remove / add custom parsers through the TikaConfig. But I agree, its 
hard to maintain, because you have to provide a static list. I would really 
like to have a separate TikaConfig option to explicitely disable some parsers, 
so I can use the default SPI lookup, but blacklist parsers. We would like to 
do the same in Solr, too.

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
Assignee: Tyler Palsulich
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1556) Clean up whitespace in tika-server

2015-02-20 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1556:
-

 Summary: Clean up whitespace in tika-server
 Key: TIKA-1556
 URL: https://issues.apache.org/jira/browse/TIKA-1556
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial
 Fix For: 1.8


We have 2- and 4-space indents in different parts of tika-server's code.  Let's 
make consistent with rest of Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1323) Improve exception reporting in JAX-RS server

2015-02-20 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329426#comment-14329426
 ] 

Tim Allison commented on TIKA-1323:
---

now running with this option on TIKA-1301's server: 162.209.99.130 port 9998

 Improve exception reporting in JAX-RS server
 

 Key: TIKA-1323
 URL: https://issues.apache.org/jira/browse/TIKA-1323
 Project: Tika
  Issue Type: Improvement
  Components: server
Reporter: Tim Allison
Priority: Minor

 I'd like to use tika-server for TIKA-1302.  As part of that, I'd like to 
 record exception stacktraces per document.  I see two options: transmit the 
 info back to the client (assuming a doc didn't bring the server down :) ) 
 along with the current error code or log the document id and stacktrace via 
 the server.  Given my current design thoughts, I'd prefer the first option.
 Any objections or recommendations?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1323) Improve exception reporting in JAX-RS server

2015-02-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329432#comment-14329432
 ] 

Hudson commented on TIKA-1323:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #500 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/500/])
TIKA-1323: allow tika-server to return stack traces from parse exceptions for 
easier analysis of parser exceptions via tika-server. (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1661193)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-server/pom.xml
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/MetadataResource.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/RecursiveMetadataResource.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaExceptionMapper.java
* /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaResource.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerParseException.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerParseExceptionMapper.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/UnpackerResource.java
* /tika/trunk/tika-server/src/test/java/org/apache/tika/server/CXFTestBase.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/DetectorResourceTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/MetadataResourceTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/StackTraceOffTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/StackTraceTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaResourceTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/UnpackerResourceTest.java
* /tika/trunk/tika-server/src/test/resources/META-INF
* /tika/trunk/tika-server/src/test/resources/META-INF/services
* 
/tika/trunk/tika-server/src/test/resources/META-INF/services/org.apache.tika.parser.Parser
* /tika/trunk/tika-server/src/test/resources/evil
* /tika/trunk/tika-server/src/test/resources/evil/null_pointer.evil
* /tika/trunk/tika-server/src/test/resources/mime
* /tika/trunk/tika-server/src/test/resources/mime/custom-mimetypes.xml


 Improve exception reporting in JAX-RS server
 

 Key: TIKA-1323
 URL: https://issues.apache.org/jira/browse/TIKA-1323
 Project: Tika
  Issue Type: Improvement
  Components: server
Reporter: Tim Allison
Priority: Minor

 I'd like to use tika-server for TIKA-1302.  As part of that, I'd like to 
 record exception stacktraces per document.  I see two options: transmit the 
 info back to the client (assuming a doc didn't bring the server down :) ) 
 along with the current error code or log the document id and stacktrace via 
 the server.  Given my current design thoughts, I'd prefer the first option.
 Any objections or recommendations?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1556) Clean up whitespace in tika-server

2015-02-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329433#comment-14329433
 ] 

Hudson commented on TIKA-1556:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #500 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/500/])
TIKA-1556 clean up whitespace in tika-server (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1661200)
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/CSVMessageBodyWriter.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/DetectorResource.java
* /tika/trunk/tika-server/src/main/java/org/apache/tika/server/HTMLHelper.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/MetadataListMessageBodyWriter.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/MetadataResource.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/RecursiveMetadataResource.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/RichTextContentHandler.java
* /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TarWriter.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TextMessageBodyWriter.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaDetectors.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaLoggingFilter.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaMimeTypes.java
* /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaParsers.java
* /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaResource.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerParseExceptionMapper.java
* /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaVersion.java
* /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaWelcome.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/UnpackerResource.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/XMPMessageBodyWriter.java
* /tika/trunk/tika-server/src/main/java/org/apache/tika/server/ZipWriter.java
* /tika/trunk/tika-server/src/test/java/org/apache/tika/server/CXFTestBase.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/DetectorResourceTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/MetadataResourceTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/RecursiveMetadataResourceTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/StackTraceOffTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/StackTraceTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaDetectorsTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaMimeTypesTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaParsersTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaResourceTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaVersionTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaWelcomeTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/UnpackerResourceTest.java


 Clean up whitespace in tika-server
 --

 Key: TIKA-1556
 URL: https://issues.apache.org/jira/browse/TIKA-1556
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial
 Fix For: 1.8


 We have 2- and 4-space indents in different parts of tika-server's code.  
 Let's make consistent with rest of Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329438#comment-14329438
 ] 

Tyler Palsulich commented on TIKA-1555:
---

You can also disable OCR by setting the Tesseract path to  in the 
[TesseractOCRConfig|https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java].

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
Assignee: Tyler Palsulich
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread David Pilato (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329446#comment-14329446
 ] 

David Pilato commented on TIKA-1555:


I read the code and it sounds like to me that is the default value. executable 
is appended to this path then.

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
Assignee: Tyler Palsulich
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread David Pilato (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329452#comment-14329452
 ] 

David Pilato commented on TIKA-1555:


Well I could try but for now I did not manage to reproduce it at 100% of time. 
I need to think about it and understand what is wrong with my test config.
Sadly, when I got the issue I could not see the Locale and all other settings 
used. 

But I will for sure.

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
Assignee: Tyler Palsulich
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329470#comment-14329470
 ] 

Tyler Palsulich commented on TIKA-1555:
---

My mistake. Please see [this 
test|https://github.com/apache/tika/blob/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java#L56].
 Try setting the path to gibberish.

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
Assignee: Tyler Palsulich
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329474#comment-14329474
 ] 

Uwe Schindler commented on TIKA-1555:
-

bq. You can also disable OCR by setting the Tesseract path to  in the 
TesseractOCRConfig.

This did not work. If this would disable the fork I would be happy. But it just 
disables parser as side effect because it tries to fork an invalid process path 
which is created from empty string and sone sufix.

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
Assignee: Tyler Palsulich
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1557) Create TesseractOCR Option to Never Run

2015-02-20 Thread Tyler Palsulich (JIRA)
Tyler Palsulich created TIKA-1557:
-

 Summary: Create TesseractOCR Option to Never Run
 Key: TIKA-1557
 URL: https://issues.apache.org/jira/browse/TIKA-1557
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
Priority: Minor
 Fix For: 1.8


As brought up in TIKA-1555, TesseractOCRParser should have an option to never 
be run. So, we can add an {{enabled}} option to the Config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform

2015-02-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329486#comment-14329486
 ] 

Tyler Palsulich commented on TIKA-1555:
---

[~thetaphi], that's true. Please see TIKA-1557.

 posix_spawn is not a supported process launch mechanism on this platform
 

 Key: TIKA-1555
 URL: https://issues.apache.org/jira/browse/TIKA-1555
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
 Environment: MacOS X 10.10.2
Reporter: David Pilato
Assignee: Tyler Palsulich
  Labels: ocr, parser

 It could happen on some systems that posix_spawn is not a supported process 
 launch mechanism.
 We are doing random testing which simulates different kind of Locale so I 
 could sometime hit that issue:
 {code}
 java.lang.Error: posix_spawn is not a supported process launch mechanism on 
 this platform.
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:104)
   at java.lang.UNIXProcess$1.run(UNIXProcess.java:93)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.lang.UNIXProcess.clinit(UNIXProcess.java:91)
   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
   at java.lang.Runtime.exec(Runtime.java:617)
   at java.lang.Runtime.exec(Runtime.java:485)
   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:506)
 {code}
 It sounds like it's related to this: 
 http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/
 Though I have hard time to reproduce it!
 BTW I wonder if we could add a setting which can return {{false}} for 
 {{TesseractOCRParser#hasTesseract}} even if we have tesseract available.
 For example, let say that my machine shares multiple application and for one 
 of them I don't want any OCR on my documents.
 Hope this helps.
 Let me know if you need more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)