[ 
https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2113:
--------------------------------
    Fix Version/s:     (was: 0.10.0)
                   0.11.0

> [Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the 
> classpath setting HDFS logic
> -----------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-2113
>                 URL: https://issues.apache.org/jira/browse/ARROW-2113
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.8.0
>         Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH 
> 5.13.1
>            Reporter: Michal Danko
>            Priority: Major
>             Fix For: 0.11.0
>
>
> Steps to replicate the issue:
> mkdir /tmp/test
>  cd /tmp/test
>  mkdir jars
>  cd jars
>  touch test1.jar
>  mkdir -p ../lib/zookeeper
>  cd ../lib/zookeeper
>  ln -s ../../jars/test1.jar ./test1.jar
>  ln -s test1.jar test.jar
>  mkdir -p ../hadoop/lib
>  cd ../hadoop/lib
>  ln -s ../../../lib/zookeeper/test.jar ./test.jar
> (this part depends on your configuration you need those values for 
> pyarrow.hdfs to work: )
> (path to libjvm: )
> (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera)
> (path to libhdfs: )
> (export 
> LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/)
> export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar"
> python
>  import pyarrow.hdfs as hdfs;
>  fs = hdfs.connect(user="hdfs")
>  
> Ends with error:
> ------------
>  loadFileSystems error:
>  (unable to get root cause for java.lang.NoClassDefFoundError)
>  (unable to get stack trace for java.lang.NoClassDefFoundError)
>  hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, 
> kerbTicketCachePath=(NULL), userName=pa) error:
>  (unable to get root cause for java.lang.NoClassDefFoundError)
>  (unable to get stack trace for java.lang.NoClassDefFoundError)
>  Traceback (most recent call last): (
>  File "<stdin>", line 1, in <module>
>  File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
> 170, in connect
>  kerb_ticket=kerb_ticket, driver=driver)
>  File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
> 37, in __init__
>  self._connect(host, port, user, kerb_ticket, driver)
>  File "pyarrow/io-hdfs.pxi", line 87, in 
> pyarrow.lib.HadoopFileSystem._connect 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673)
>  File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345)
>  pyarrow.lib.ArrowIOError: HDFS connection failed
>  -------------
>  
> export CLASSPATH="/tmp/test/lib/zookeeper/test.jar"
>  python
>  import pyarrow.hdfs as hdfs;
>  fs = hdfs.connect(user="hdfs")
>  
> Works properly.
>  
> I can't find reason why first CLASSPATH doesn't work and second one does, 
> because it's path to same .jar, just with extra symlink in it. To me, it 
> looks like pyarrow.lib.check has problem with symlinks defined with many 
> ../.../.. .
> I would expect that pyarrow would work with any definition of path to .jar
> Please notice that path are not generated at random, it is path copied from 
> Cloudera distribution of Hadoop (original file was zookeeper.jar),
> Because of this issue, our customer currently can't use pyarrow lib for oozie 
> workflows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to