PySpark failing on a mid-sized broadcast

2015-11-30 Thread ameyc
So I'm running PySpark 1.3.1 on Amazon EMR on a fairly beefy cluster (20 node cluster with 32 cores each node and 64 gig memory) and my parallelism, executor.instances, executor.cores and executor memory settings are also fairly reasonable (600, 20, 30, 48gigs). However my job invariably fails

Re: PySpark failing on a mid-sized broadcast

2015-11-30 Thread ameyc
BTW, my spark.python.worker.reuse setting is set to "true". -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-failing-on-a-mid-sized-broadcast-tp25520p25521.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

hadoop_conf_dir when running spark on yarn

2014-10-31 Thread ameyc
How do i setup hadoop_conf_dir correctly when I'm running my spark job on Yarn? My Yarn environment has the correct hadoop_conf_dir settings by the configuration that I pull from sc.hadoopConfiguration() is incorrect. -- View this message in context:

Spark streaming on data at rest.

2014-10-16 Thread ameyc
Apologies if this is something very obvious but I've perused the spark streaming guide and this isn't very evident to me still. So I have files with data of the format: timestamp,column1,column2,column3.. etc. and I'd like to use the spark streaming's window operations on them. However from what