On-prem I'm running PySpark on Cloudera's distribution, and I've never had to worry about dependency issues. I import my libraries on my driver node only using pip or conda, run my jobs in yarn-client mode, and everything works (I just assumed the relevant libraries are copied temporarily to each executor node during execution).
But on EMR, I installed a library called fuzzywuzzy on the driver using pip, then tried running this basic script in "pyspark --master yarn-client" mode: >>> mydata = sc.textFile("s3n://my_bucket/rum_20160331/*") sample = mydata.take(3) new_rdd = sc.parallelize(sample) import random import fuzzywuzzy choices = ['hello', 'xylophone', 'zebra'] mapped_rdd = new_rdd.map(lambda row: str(fuzzywuzzy.process.extract(row, choices, limit=2))) mapped_rdd.collect() >>> and I'm getting the error: ImportError: ('No module named fuzzywuzzy', <function subimport at 0x7fa66610a938>, ('fuzzywuzzy',)) which makes me think I have to use py-files for the first time ever, and resolve dependencies manually. Why does this happen? How is it that, on the on-prem Cloudera version, Spark executor nodes are able to access all the libraries I've only installed on my driver, but on EMR they can't? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-do-I-need-to-handle-dependencies-on-EMR-but-not-on-prem-Hadoop-tp26712.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org