You probably just got lucky, and the default Python distribution on
your CDH nodes has this library but the EMR one doesn't. (CDH actually
has an Anaconda distribution, not sure if you enabled that.) In
general you need to make dependencies available that your app does not
supply.
On Fri, Apr 8,
On-prem I'm running PySpark on Cloudera's distribution, and I've never had to
worry about dependency issues. I import my libraries on my driver node only
using pip or conda, run my jobs in yarn-client mode, and everything works (I
just assumed the relevant libraries are copied temporarily to each