You probably just got lucky, and the default Python distribution on
your CDH nodes has this library but the EMR one doesn't. (CDH actually
has an Anaconda distribution, not sure if you enabled that.) In
general you need to make dependencies available that your app does not
supply.

On Fri, Apr 8, 2016 at 8:28 AM, YaoPau <jonrgr...@gmail.com> wrote:
> On-prem I'm running PySpark on Cloudera's distribution, and I've never had to
> worry about dependency issues.  I import my libraries on my driver node only
> using pip or conda, run my jobs in yarn-client mode, and everything works (I
> just assumed the relevant libraries are copied temporarily to each executor
> node during execution).
>
> But on EMR, I installed a library called fuzzywuzzy on the driver using pip,
> then tried running this basic script in "pyspark --master yarn-client" mode:
>
>>>>
> mydata = sc.textFile("s3n://my_bucket/rum_20160331/*")
> sample = mydata.take(3)
> new_rdd = sc.parallelize(sample)
> import random
> import fuzzywuzzy
>
> choices = ['hello', 'xylophone', 'zebra']
> mapped_rdd = new_rdd.map(lambda row: str(fuzzywuzzy.process.extract(row,
> choices, limit=2)))
> mapped_rdd.collect()
>>>>
>
> and I'm getting the error:
>
> ImportError: ('No module named fuzzywuzzy', <function subimport at
> 0x7fa66610a938>, ('fuzzywuzzy',))
>
> which makes me think I have to use py-files for the first time ever, and
> resolve dependencies manually.
>
> Why does this happen?  How is it that, on the on-prem Cloudera version,
> Spark executor nodes are able to access all the libraries I've only
> installed on my driver, but on EMR they can't?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-do-I-need-to-handle-dependencies-on-EMR-but-not-on-prem-Hadoop-tp26712.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to