On-prem I'm running PySpark on Cloudera's distribution, and I've never had to
worry about dependency issues.  I import my libraries on my driver node only
using pip or conda, run my jobs in yarn-client mode, and everything works (I
just assumed the relevant libraries are copied temporarily to each executor
node during execution).

But on EMR, I installed a library called fuzzywuzzy on the driver using pip,
then tried running this basic script in "pyspark --master yarn-client" mode:

>>>
mydata = sc.textFile("s3n://my_bucket/rum_20160331/*")
sample = mydata.take(3)
new_rdd = sc.parallelize(sample)
import random
import fuzzywuzzy

choices = ['hello', 'xylophone', 'zebra']
mapped_rdd = new_rdd.map(lambda row: str(fuzzywuzzy.process.extract(row,
choices, limit=2)))
mapped_rdd.collect()
>>>

and I'm getting the error:

ImportError: ('No module named fuzzywuzzy', <function subimport at
0x7fa66610a938>, ('fuzzywuzzy',)) 

which makes me think I have to use py-files for the first time ever, and
resolve dependencies manually.

Why does this happen?  How is it that, on the on-prem Cloudera version,
Spark executor nodes are able to access all the libraries I've only
installed on my driver, but on EMR they can't?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-do-I-need-to-handle-dependencies-on-EMR-but-not-on-prem-Hadoop-tp26712.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to