On the on-premise Cloudera Hadoop 5.7.2 I have installed the anaconda package and trying to setup Jupyter notebook to work with spark1.6.

 

I have ran into problems when I trying to use the package com.databricks:spark-csv_2.10:1.4.0 for reading and inferring the schema of the csv file using python spark.

 

I have installed the jar file - spark-csv_2.10-1.4.0.jar in /var/opt/teradata/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jar and configurations are set as  :

 

export PYSPARK_DRIVER_PYTHON=/var/opt/teradata/cloudera/parcels/Anaconda-4.0.0/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=False --NotebookApp.ip='*' --NotebookApp.port=8083"
export PYSPARK_PYTHON=/var/opt/teradata/cloudera/parcels/Anaconda-4.0.0/bin/python

 

When I run pyspark from the command line with packages option, like :

 

$pyspark --packages com.databricks:spark-csv_2.10:1.4.0

 

It throws the error and fails to recognize the added dependency.

 

Any ideas on how to resolve this error is much appreciated.

 

Also, any ideas on the experience in installing and running Jupyter notebook with anaconda and spark please share.

 

thanks,

Muby

 

 

--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to