[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

Ken Krugler (JIRA) Sat, 19 May 2018 16:17:33 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481791#comment-16481791
 ]


Ken Krugler commented on TIKA-2643:
-----------------------------------

Looking at the crash log, I see the following duplicate jars w/different 
versions:

asm 3.2
asm 5.0.4

commons-codec 1.10
commons-codec 1.4

commons-compress 1.15
commons-compress 1.4.1

commons-httpclient 3.0.1
commons-httpclient 3.1

commons-io 2.4
commons-io 2.5

curator-client 2.6.0
curator-client 2.7.1

curator-framework 2.6.0
curator-framework 2.7.1

curator-recipes 2.6.0
curator-recipes 2.7.1

gson 2.2.4
gson 2.8.1

guava 11.0.2
guava 14.0.1

hamcrest-core 1.1
hamcrest-core 1.3

httpclient 4.2.5
httpclient 4.5.3

httpcore 4.2.5
httpcore 4.4.6

jackson-annotations 2.2.2
jackson-annotations 2.2.3

jackson-core 2.2.2
jackson-core 2.2.3
jackson-core 2.8.9

jackson-databind 2.2.2
jackson-databind 2.2.3

jackson-jaxrs 1.8.8
jackson-jaxrs 1.9.2

jackson-xc 1.8.8
jackson-xc 1.9.2

And so on...

I'm hoping these aren't there with a generic Cloudera installation, which means 
you need to manage your Hadoop job jar to ensure you don't have duplicates. 
Which can lead to issues with CDH having an older version of a jar than what 
you need for your own code or Tika.

> Tika call hangs when processes a pdf on Cloudera Hadoop
> -------------------------------------------------------
>
>                 Key: TIKA-2643
>                 URL: https://issues.apache.org/jira/browse/TIKA-2643
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17
>         Environment: Cloudera Hadoop 5.8
>            Reporter: feng ye
>            Priority: Blocker
>         Attachments: hang-stdout.txt, hang.zip, hs_err_pid32104.log, 
> testJournalParser.pdf
>
>
> Tika.parseToString(InputStream) hangs when called within a MapReduce job to 
> process a pdf file from Cloudera Hadoop 5.8 (observed on 5.4 too). It can 
> process some other pdf files on the same cluster. I am attaching the file and 
> the syslog as well as stdout logs. Interesting that the same file can be 
> processed fine over a Hortonworks cluster. 
> This issue is a blocker for us to make our feature based on Tika available to 
> Cloudera cluster, a major flavor of Hadoop, so your timely attention would be 
> very much appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

Reply via email to