[ https://issues.apache.org/jira/browse/ARROW-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167943#comment-17167943 ]
Michael Peleshenko commented on ARROW-5236: ------------------------------------------- I've been having trouble connecting to HDFS even with the 1.0.0 pyarrow build as I run into the below error when running: {code:python} pa.hdfs.connect(host="host", port=port, user="user", kerb_ticket="kerb_ticket") {code} {noformat} File "C:\ProgramData\Continuum\Anaconda\envs\pyarrow-test\lib\site-packages\pyarrow\hdfs.py", line 210 in connect extra_conf=extra_conf) File "C:\ProgramData\Continuum\Anaconda\envs\pyarrow-test\lib\site-packages\pyarrow\hdfs.py", line 40, in __init__ self._connect(host, port, user, kerb_ticket, extra_conf) File "pyarrow\io-hdfs.pxi", line 75, in pyarrow.lib.HadoopFileSystem._connect File "pyarrow\error.pxi", line 99, in pyarrow.lib.check_status OSError: Unable to load libjvm: The specified module could not be found. {noformat} I tried the workaround mentioned [here|https://issues.apache.org/jira/browse/ARROW-5236?focusedCommentId=17106888&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17106888] and got it working by copying jvm.dll into %JAVA_HOME%\lib\server\libjvm.so. It seems the logic to find libjvm is following a Linux path for some reason. Looking into the arrow internals, I came across this: https://github.com/apache/arrow/blob/b0d623957db820de4f1ff0a5ebd3e888194a48f0/cpp/src/arrow/io/hdfs_internal.cc#L176-L180 This looks like the same issue observed in ARROW-1003, except that one was for libhdfs. In my situation, libhdfs is found as expected as hdfs.dll, so Windows logic is definitely followed there. https://github.com/apache/arrow/blob/b0d623957db820de4f1ff0a5ebd3e888194a48f0/cpp/src/arrow/io/hdfs_internal.cc#L144-L145 I suspect a similar fix is needed here to change `__WIN32` to `_WIN32`. > [Python] hdfs.connect() is trying to load libjvm in windows > ----------------------------------------------------------- > > Key: ARROW-5236 > URL: https://issues.apache.org/jira/browse/ARROW-5236 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Environment: Windows 7 Enterprise, pyarrow 0.13.0 > Reporter: Kamaraju > Priority: Major > Labels: hdfs > > This issue was originally reported at > [https://github.com/apache/arrow/issues/4215] . Raising a Jira as per Wes > McKinney's request. > Summary: > The following script > {code} > $ cat expt2.py > import pyarrow as pa > fs = pa.hdfs.connect() > {code} > tries to load libjvm in windows 7 which is not expected. > {noformat} > $ python ./expt2.py > Traceback (most recent call last): > File "./expt2.py", line 3, in <module> > fs = pa.hdfs.connect() > File > "C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py", > line 183, in connect > extra_conf=extra_conf) > File > "C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py", > line 37, in __init__ > self._connect(host, port, user, kerb_ticket, driver, extra_conf) > File "pyarrow\io-hdfs.pxi", line 89, in > pyarrow.lib.HadoopFileSystem._connect > File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unable to load libjvm > {noformat} > There is no libjvm file in Windows Java installation. > {noformat} > $ echo $JAVA_HOME > C:\Progra~1\Java\jdk1.8.0_141 > $ find $JAVA_HOME -iname '*libjvm*' > <returns nothing.> > {noformat} > I see the libjvm error with both 0.11.1 and 0.13.0 versions of pyarrow. > Steps to reproduce the issue (with more details): > Create the environment > {noformat} > $ cat scratch_py36_pyarrow.yml > name: scratch_py36_pyarrow > channels: > - defaults > dependencies: > - python=3.6.8 > - pyarrow > {noformat} > {noformat} > $ conda env create -f scratch_py36_pyarrow.yml > {noformat} > Apply the following patch to lib/site-packages/pyarrow/hdfs.py . I had to do > this since the Hadoop installation that comes with MapR <[https://mapr.com/]> > windows client only has $HADOOP_HOME/bin/hadoop.cmd . There is no file named > $HADOOP_HOME/bin/hadoop and so the subsequent subprocess.check_output call > fails with FileNotFoundError if this patch is not applied. > {noformat} > $ cat ~/x/patch.txt > 131c131 > < hadoop_bin = '{0}/bin/hadoop'.format(os.environ['HADOOP_HOME']) > --- > > hadoop_bin = '{0}/bin/hadoop.cmd'.format(os.environ['HADOOP_HOME']) > $ patch > /c/ProgramData/Continuum/Anaconda/envs/scratch_py36_pyarrow/lib/site-packages/pyarrow/hdfs.py > ~/x/patch.txt > patching file > /c/ProgramData/Continuum/Anaconda/envs/scratch_py36_pyarrow/lib/site-packages/pyarrow/hdfs.py > {noformat} > Activate the environment > {noformat} > $ source activate scratch_py36_pyarrow > {noformat} > Sample script > {noformat} > $ cat expt2.py > import pyarrow as pa > fs = pa.hdfs.connect() > {noformat} > Execute the script > {noformat} > $ python ./expt2.py > Traceback (most recent call last): > File "./expt2.py", line 3, in <module> > fs = pa.hdfs.connect() > File > "C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py", > line 183, in connect > extra_conf=extra_conf) > File > "C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py", > line 37, in __init__ > self._connect(host, port, user, kerb_ticket, driver, extra_conf) > File "pyarrow\io-hdfs.pxi", line 89, in > pyarrow.lib.HadoopFileSystem._connect > File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unable to load libjvm > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)