[ https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-2113: ---------------------------------- Labels: pull-request-available (was: ) > [Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the > classpath setting HDFS logic > ----------------------------------------------------------------------------------------------------- > > Key: ARROW-2113 > URL: https://issues.apache.org/jira/browse/ARROW-2113 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.8.0 > Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH > 5.13.1 > Reporter: Michal Danko > Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > > Steps to replicate the issue: > mkdir /tmp/test > cd /tmp/test > mkdir jars > cd jars > touch test1.jar > mkdir -p ../lib/zookeeper > cd ../lib/zookeeper > ln -s ../../jars/test1.jar ./test1.jar > ln -s test1.jar test.jar > mkdir -p ../hadoop/lib > cd ../hadoop/lib > ln -s ../../../lib/zookeeper/test.jar ./test.jar > (this part depends on your configuration you need those values for > pyarrow.hdfs to work: ) > (path to libjvm: ) > (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera) > (path to libhdfs: ) > (export > LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/) > export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar" > python > import pyarrow.hdfs as hdfs; > fs = hdfs.connect(user="hdfs") > > Ends with error: > ------------ > loadFileSystems error: > (unable to get root cause for java.lang.NoClassDefFoundError) > (unable to get stack trace for java.lang.NoClassDefFoundError) > hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, > kerbTicketCachePath=(NULL), userName=pa) error: > (unable to get root cause for java.lang.NoClassDefFoundError) > (unable to get stack trace for java.lang.NoClassDefFoundError) > Traceback (most recent call last): ( > File "<stdin>", line 1, in <module> > File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line > 170, in connect > kerb_ticket=kerb_ticket, driver=driver) > File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line > 37, in __init__ > self._connect(host, port, user, kerb_ticket, driver) > File "pyarrow/io-hdfs.pxi", line 87, in > pyarrow.lib.HadoopFileSystem._connect > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673) > File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345) > pyarrow.lib.ArrowIOError: HDFS connection failed > ------------- > > export CLASSPATH="/tmp/test/lib/zookeeper/test.jar" > python > import pyarrow.hdfs as hdfs; > fs = hdfs.connect(user="hdfs") > > Works properly. > > I can't find reason why first CLASSPATH doesn't work and second one does, > because it's path to same .jar, just with extra symlink in it. To me, it > looks like pyarrow.lib.check has problem with symlinks defined with many > ../.../.. . > I would expect that pyarrow would work with any definition of path to .jar > Please notice that path are not generated at random, it is path copied from > Cloudera distribution of Hadoop (original file was zookeeper.jar), > Because of this issue, our customer currently can't use pyarrow lib for oozie > workflows. -- This message was sent by Atlassian JIRA (v7.6.3#76005)