[ 
https://issues.apache.org/jira/browse/ARROW-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167943#comment-17167943
 ] 

Michael Peleshenko commented on ARROW-5236:
-------------------------------------------

I've been having trouble connecting to HDFS even with the 1.0.0 pyarrow build 
as I run into the below error when running:

{code:python}
pa.hdfs.connect(host="host", port=port, user="user", kerb_ticket="kerb_ticket")
{code}

{noformat}
  File 
"C:\ProgramData\Continuum\Anaconda\envs\pyarrow-test\lib\site-packages\pyarrow\hdfs.py",
 line 210 in connect
    extra_conf=extra_conf)
  File 
"C:\ProgramData\Continuum\Anaconda\envs\pyarrow-test\lib\site-packages\pyarrow\hdfs.py",
 line 40, in __init__
    self._connect(host, port, user, kerb_ticket, extra_conf)
  File "pyarrow\io-hdfs.pxi", line 75, in pyarrow.lib.HadoopFileSystem._connect
  File "pyarrow\error.pxi", line 99, in pyarrow.lib.check_status
OSError: Unable to load libjvm: The specified module could not be found.
{noformat}

I tried the workaround mentioned 
[here|https://issues.apache.org/jira/browse/ARROW-5236?focusedCommentId=17106888&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17106888]
 and got it working by copying jvm.dll into %JAVA_HOME%\lib\server\libjvm.so. 
It seems the logic to find libjvm is following a Linux path for some reason.

Looking into the arrow internals, I came across this:
https://github.com/apache/arrow/blob/b0d623957db820de4f1ff0a5ebd3e888194a48f0/cpp/src/arrow/io/hdfs_internal.cc#L176-L180

This looks like the same issue observed in ARROW-1003, except that one was for 
libhdfs. In my situation, libhdfs is found as expected as hdfs.dll, so Windows 
logic is definitely followed there.
https://github.com/apache/arrow/blob/b0d623957db820de4f1ff0a5ebd3e888194a48f0/cpp/src/arrow/io/hdfs_internal.cc#L144-L145

I suspect a similar fix is needed here to change `__WIN32` to `_WIN32`.

> [Python] hdfs.connect() is trying to load libjvm in windows
> -----------------------------------------------------------
>
>                 Key: ARROW-5236
>                 URL: https://issues.apache.org/jira/browse/ARROW-5236
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>         Environment: Windows 7 Enterprise, pyarrow 0.13.0
>            Reporter: Kamaraju
>            Priority: Major
>              Labels: hdfs
>
> This issue was originally reported at 
> [https://github.com/apache/arrow/issues/4215] . Raising a Jira as per Wes 
> McKinney's request.
> Summary:
>  The following script
> {code}
> $ cat expt2.py
> import pyarrow as pa
> fs = pa.hdfs.connect()
> {code}
> tries to load libjvm in windows 7 which is not expected.
> {noformat}
> $ python ./expt2.py
> Traceback (most recent call last):
>   File "./expt2.py", line 3, in <module>
>     fs = pa.hdfs.connect()
>   File 
> "C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py",
>  line 183, in connect
>     extra_conf=extra_conf)
>   File 
> "C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py",
>  line 37, in __init__
>     self._connect(host, port, user, kerb_ticket, driver, extra_conf)
>   File "pyarrow\io-hdfs.pxi", line 89, in 
> pyarrow.lib.HadoopFileSystem._connect
>   File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unable to load libjvm
> {noformat}
> There is no libjvm file in Windows Java installation.
> {noformat}
> $ echo $JAVA_HOME
> C:\Progra~1\Java\jdk1.8.0_141
> $ find $JAVA_HOME -iname '*libjvm*'
> <returns nothing.>
> {noformat}
> I see the libjvm error with both 0.11.1 and 0.13.0 versions of pyarrow.
> Steps to reproduce the issue (with more details):
> Create the environment
> {noformat}
> $ cat scratch_py36_pyarrow.yml
> name: scratch_py36_pyarrow
> channels:
>   - defaults
> dependencies:
>   - python=3.6.8
>   - pyarrow
> {noformat}
> {noformat}
> $ conda env create -f scratch_py36_pyarrow.yml
> {noformat}
> Apply the following patch to lib/site-packages/pyarrow/hdfs.py . I had to do 
> this since the Hadoop installation that comes with MapR <[https://mapr.com/]> 
> windows client only has $HADOOP_HOME/bin/hadoop.cmd . There is no file named 
> $HADOOP_HOME/bin/hadoop and so the subsequent subprocess.check_output call 
> fails with FileNotFoundError if this patch is not applied.
> {noformat}
> $ cat ~/x/patch.txt
> 131c131
> <         hadoop_bin = '{0}/bin/hadoop'.format(os.environ['HADOOP_HOME'])
> ---
> >         hadoop_bin = '{0}/bin/hadoop.cmd'.format(os.environ['HADOOP_HOME'])
> $ patch 
> /c/ProgramData/Continuum/Anaconda/envs/scratch_py36_pyarrow/lib/site-packages/pyarrow/hdfs.py
>  ~/x/patch.txt
> patching file 
> /c/ProgramData/Continuum/Anaconda/envs/scratch_py36_pyarrow/lib/site-packages/pyarrow/hdfs.py
> {noformat}
> Activate the environment
> {noformat}
> $ source activate scratch_py36_pyarrow
> {noformat}
> Sample script
> {noformat}
> $ cat expt2.py
> import pyarrow as pa
> fs = pa.hdfs.connect()
> {noformat}
> Execute the script
> {noformat}
> $ python ./expt2.py
> Traceback (most recent call last):
>   File "./expt2.py", line 3, in <module>
>     fs = pa.hdfs.connect()
>   File 
> "C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py",
>  line 183, in connect
>     extra_conf=extra_conf)
>   File 
> "C:\ProgramData\Continuum\Anaconda\envs\scratch_py36_pyarrow\lib\site-packages\pyarrow\hdfs.py",
>  line 37, in __init__
>     self._connect(host, port, user, kerb_ticket, driver, extra_conf)
>   File "pyarrow\io-hdfs.pxi", line 89, in 
> pyarrow.lib.HadoopFileSystem._connect
>   File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unable to load libjvm
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to