Re: Why do I need to handle dependencies on EMR but not on-prem Hadoop?

2016-04-08 Thread Sean Owen
You probably just got lucky, and the default Python distribution on your CDH nodes has this library but the EMR one doesn't. (CDH actually has an Anaconda distribution, not sure if you enabled that.) In general you need to make dependencies available that your app does not supply. On Fri, Apr 8,

Why do I need to handle dependencies on EMR but not on-prem Hadoop?

2016-04-08 Thread YaoPau
On-prem I'm running PySpark on Cloudera's distribution, and I've never had to worry about dependency issues. I import my libraries on my driver node only using pip or conda, run my jobs in yarn-client mode, and everything works (I just assumed the relevant libraries are copied temporarily to each