What is performance overhead caused by YARN, or what configurations are being changed when the app is ran through YARN?
The following example: sqlContext.sql("SELECT dayStamp(date), count(distinct deviceId) AS c FROM full GROUP BY dayStamp(date) ORDER BY c DESC LIMIT 10") .collect() runs on shell when we use standalone scheduler: ./spark-shell --master sparkmaster:7077 --executor-memory 20g --executor-cores 10 --driver-memory 10g --num-executors 8 and fails due to losing an executor, when we run it through YARN. ./spark-shell --master yarn-client --executor-memory 20g --executor-cores 10 --driver-memory 10g --num-executors 8 There are no evident logs, just messages that executors are being lost, and connection refused errors, (apparently due to executor failures) The cluster is the same, 8 nodes, 64Gb RAM each. Format is parquet. -- RGRDZ Harut