workarround JDK-8047340 & JDK-8055301 so Turkish Tika users can still use non-external parsers

Uwe Schindler (JIRA) Thu, 22 Jan 2015 15:27:57 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14288444#comment-14288444
 ]


Uwe Schindler commented on TIKA-1526:
-------------------------------------

Hi Tylor: The problem is explained above. To replicate the problem you have to 
be careful: The original error happens *exactly once*. All later tries to use 
the same JVM will cause a NoClassDefFoundError on UnixProcess class. In fact 
all later tries to execute and fork a process will fail, but with a 
NoClassDefFoundError. Unfortunately I am very tired at the moment, it is past 
midnight.

The main problem is that all other ExternalParserTests will/may fail afterwards 
in the same JVM if the turkish locale is used.

The commit will fix the issue we see in Solr, but the original issue may still 
survive if you really try to use ExternalParser for other tests. For which 
other parsers is it used currently? Only for tesseract or also other ones? In 
Solr we have the problem, because the TesseractParser fails to execute the 
initialization (which MIME types it is responsble for) - and thats the fatal 
problem. I have no idea about other parsers, if they just fail while parsing I 
don't care. The big problem is the Tesseract parser that fails in turkish 
locale and blocks other parsers to execute, because the call to 
getSupportedTypes() fails [and thats the horrible thing in this bug].

So basically to reproduce: Choose exactly one test you know that fails and try 
with and without the patch. Don't run other tests that may spawn processes in 
the same JVM.

> ExternalParser should trap/ignore/workarround JDK-8047340 & JDK-8055301 so 
> Turkish Tika users can still use non-external parsers
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-1526
>                 URL: https://issues.apache.org/jira/browse/TIKA-1526
>             Project: Tika
>          Issue Type: Wish
>            Reporter: Hoss Man
>
> the JDK has numerous pain points regarding the Turkish locale, "posix_spawn" 
> lowercasing being one of them...
> https://bugs.openjdk.java.net/browse/JDK-8047340
> https://bugs.openjdk.java.net/browse/JDK-8055301
> As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
> enabled & configured by default in Tika, and uses ExternalParser.check to see 
> if tesseract is available -- but because of the JDK bug, this means that Tika 
> fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
> so...
> {noformat}
>   [junit4]    > Throwable #1: java.lang.Error: posix_spawn is not a supported 
> process launch mechanism on this platform.
>   [junit4]    >       at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
>   [junit4]    >       at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
>   [junit4]    >       at java.security.AccessController.doPrivileged(Native 
> Method)
>   [junit4]    >       at java.lang.UNIXProcess.<clinit>(UNIXProcess.java:92)
>   [junit4]    >       at java.lang.ProcessImpl.start(ProcessImpl.java:130)
>   [junit4]    >       at 
> java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
>   [junit4]    >       at java.lang.Runtime.exec(Runtime.java:620)
>   [junit4]    >       at java.lang.Runtime.exec(Runtime.java:485)
>   [junit4]    >       at 
> org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
>   [junit4]    >       at 
> org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
>   [junit4]    >       at 
> org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
>   [junit4]    >       at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
>   [junit4]    >       at 
> org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
>   [junit4]    >       at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
>   [junit4]    >       at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
>   [junit4]    >       at 
> org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
>   [junit4]    >       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>   [junit4]    >       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> {noformat}
> ...unless they go out of their way to white list only the parsers they 
> need/want so TesseractOCRParser (and any other ExternalParsers) will never 
> even be check()ed.
> It would be nice if Tika's ExternalParser class added a similar 
> hack/workarround to what was done in SOLR-6387 to trap these types of errors. 
>  In Solr we just propogate a better error explaining why Java hates the 
> turkish langauge...
> {code}
> } catch (Error err) {
>   if (err.getMessage() != null && (err.getMessage().contains("posix_spawn") 
> || err.getMessage().contains("UNIXProcess"))) {
>     log.warn("Error forking command due to JVM locale bug (see 
> https://issues.apache.org/jira/browse/SOLR-6387): " + err.getMessage());
>     return "(error executing: " + cmd + ")";
>   }
> }
> {code}
> ...but with Tika, it might be better for all ExternalParsers to just "opt 
> out" as if they don't recognize the filetype when they detect this type of 
> error fro m the check method (or perhaps it would be better if 
> AutoDetectParser handled this? ... i'm not really sure how it would best fit 
> into Tika's architecture)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 & JDK-8055301 so Turkish Tika users can still use non-external parsers

Reply via email to