Re: Why do I need to handle dependencies on EMR but not on-prem Hadoop?

2016-04-08 Thread Sean Owen
You probably just got lucky, and the default Python distribution on
your CDH nodes has this library but the EMR one doesn't. (CDH actually
has an Anaconda distribution, not sure if you enabled that.) In
general you need to make dependencies available that your app does not
supply.

On Fri, Apr 8, 2016 at 8:28 AM, YaoPau  wrote:
> On-prem I'm running PySpark on Cloudera's distribution, and I've never had to
> worry about dependency issues.  I import my libraries on my driver node only
> using pip or conda, run my jobs in yarn-client mode, and everything works (I
> just assumed the relevant libraries are copied temporarily to each executor
> node during execution).
>
> But on EMR, I installed a library called fuzzywuzzy on the driver using pip,
> then tried running this basic script in "pyspark --master yarn-client" mode:
>
>>>>
> mydata = sc.textFile("s3n://my_bucket/rum_20160331/*")
> sample = mydata.take(3)
> new_rdd = sc.parallelize(sample)
> import random
> import fuzzywuzzy
>
> choices = ['hello', 'xylophone', 'zebra']
> mapped_rdd = new_rdd.map(lambda row: str(fuzzywuzzy.process.extract(row,
> choices, limit=2)))
> mapped_rdd.collect()
>>>>
>
> and I'm getting the error:
>
> ImportError: ('No module named fuzzywuzzy',  0x7fa66610a938>, ('fuzzywuzzy',))
>
> which makes me think I have to use py-files for the first time ever, and
> resolve dependencies manually.
>
> Why does this happen?  How is it that, on the on-prem Cloudera version,
> Spark executor nodes are able to access all the libraries I've only
> installed on my driver, but on EMR they can't?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-do-I-need-to-handle-dependencies-on-EMR-but-not-on-prem-Hadoop-tp26712.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Why do I need to handle dependencies on EMR but not on-prem Hadoop?

2016-04-08 Thread YaoPau
On-prem I'm running PySpark on Cloudera's distribution, and I've never had to
worry about dependency issues.  I import my libraries on my driver node only
using pip or conda, run my jobs in yarn-client mode, and everything works (I
just assumed the relevant libraries are copied temporarily to each executor
node during execution).

But on EMR, I installed a library called fuzzywuzzy on the driver using pip,
then tried running this basic script in "pyspark --master yarn-client" mode:

>>>
mydata = sc.textFile("s3n://my_bucket/rum_20160331/*")
sample = mydata.take(3)
new_rdd = sc.parallelize(sample)
import random
import fuzzywuzzy

choices = ['hello', 'xylophone', 'zebra']
mapped_rdd = new_rdd.map(lambda row: str(fuzzywuzzy.process.extract(row,
choices, limit=2)))
mapped_rdd.collect()
>>>

and I'm getting the error:

ImportError: ('No module named fuzzywuzzy', , ('fuzzywuzzy',)) 

which makes me think I have to use py-files for the first time ever, and
resolve dependencies manually.

Why does this happen?  How is it that, on the on-prem Cloudera version,
Spark executor nodes are able to access all the libraries I've only
installed on my driver, but on EMR they can't?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-do-I-need-to-handle-dependencies-on-EMR-but-not-on-prem-Hadoop-tp26712.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org