So I'm running PySpark 1.3.1 on Amazon EMR on a fairly beefy cluster (20 node
cluster with 32 cores each node and 64 gig memory) and my parallelism,
executor.instances, executor.cores and executor memory settings are also
fairly reasonable (600, 20, 30, 48gigs).
However my job invariably fails
BTW, my spark.python.worker.reuse setting is set to "true".
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-failing-on-a-mid-sized-broadcast-tp25520p25521.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
How do i setup hadoop_conf_dir correctly when I'm running my spark job on
Yarn? My Yarn environment has the correct hadoop_conf_dir settings by the
configuration that I pull from sc.hadoopConfiguration() is incorrect.
--
View this message in context:
Apologies if this is something very obvious but I've perused the spark
streaming guide and this isn't very evident to me still. So I have files
with data of the format: timestamp,column1,column2,column3.. etc. and I'd
like to use the spark streaming's window operations on them.
However from what