Please take a look at our setup instructions for windows that we created a 
while back for our KDD Tutorial. You may need to download a newer version 
of winutils from https://github.com/steveloughran/winutils, and don't 
forget to chmod the permissions.

The attached instructions are from 2017, so you need to adjust your 
versions.





From:   Janardhan <[email protected]>
To:     Niketan Pansare <[email protected]>, [email protected], 
Matthias Boehm <[email protected]>
Date:   03/05/2019 07:54 PM
Subject:        Hadoop is not working in dev environment on windows [since 
2.7.7 update]. Thanks.



Hi,

Since 2.7.7 update, my hadoop and winutils [prebuilt winutils.exe not
available] is not working because of file permissions.

As a workaround I have changed hadoop source locally to bypass the access
check.

But, is there any way one could run the tests without hadoop. spark-submit
is working fine for me.

Thanks,
Janardhan




1. Java 
=======
The Java version should be > 1.8.

        > java -version

Set JAVA_HOME environment variable and include %JAVA_HOME%\bin in the 
environment variable PATH

        > ls "%JAVA_HOME%"

2. Spark
========
Download and extract Spark from https://spark.apache.org/downloads.html, 

        > tar -xzf spark-2.1.0-bin-hadoop2.7.tgz

and set environment variable SPARK_HOME to point to the extracted directory.
        
Next step, install winutils:

- Download winutils.exe from 
http://github.com/steveloughran/winutils/raw/master/hadoop-2.6.0/bin/winutils.exe
  
- Place it in c:\winutils\bin
- Set environment variable HADOOP_HOME to point to c:\winutils
- Add c:\winutils\bin to the environment variable PATH.
- Finally, modify permission of hive directory that will be used by spark

        > winutils.exe chmod 777 /tmp/hive

Finally, check if Spark is correctly installed:

        > %SPARK_HOME%\bin\spark-shell
        > %SPARK_HOME%\bin\pyspark      
        
3. Python and Jupyter
=====================
Download and install Anaconda Python 2.7 from 
https://www.continuum.io/downloads#macos
(includes jupyter, and pip)


4. Libraries used in this tutorial
==================================

4.1 Graphviz
------------

To check if Graphviz is installed on your system,

        > dot --help

If you get an error, 
        
- Download and install 
http://www.graphviz.org/pub/graphviz/stable/windows/graphviz-2.38.msi
- Please ensure that the C:\Program Files (x86)\Graphviz2.38\bin folder in 
added to the PATH environment variable.
        
5. Apache SystemML
==================
cd to tutorial folder, and install this version of Apache SystemML,  

        > pip install ./systemml-1.0.0-SNAPSHOT-python.tgz

and start pyspark/Jupyter

        > set PYSPARK_DRIVER_PYTHON=jupyter
        > set PYSPARK_DRIVER_PYTHON_OPTS=notebook
        > %SPARK_HOME%\bin\pyspark --driver-memory 8g

Reply via email to