Thanks. That seems to work great, except EMR doesn't always copy the logs to S3. The behavior seems inconsistent and I am debugging it now.
On Fri, Mar 31, 2017 at 7:46 AM, Vadim Semenov <vadim.seme...@datadoghq.com> wrote: > You can provide your own log directory, where Spark log will be saved, and > that you could replay afterwards. > > Set in your job this: `spark.eventLog.dir=s3://bucket/some/directory` and > run it. > Note! The path `s3://bucket/some/directory` must exist before you run your > job, it'll not be created automatically. > > The Spark HistoryServer on EMR won't show you anything because it's > looking for logs in `hdfs:///var/log/spark/apps` by default. > > After that you can either copy the log files from s3 to the hdfs path > above, or you can copy them locally to `/tmp/spark-events` (the default > directory for spark logs) and run the history server like: > ``` > cd /usr/local/src/spark-1.6.1-bin-hadoop2.6 > sbin/start-history-server.sh > ``` > and then open http://localhost:18080 > > > > > On Thu, Mar 30, 2017 at 8:45 PM, Paul Tremblay <paulhtremb...@gmail.com> > wrote: > >> I am looking for tips on evaluating my Spark job after it has run. >> >> I know that right now I can look at the history of jobs through the web >> ui. I also know how to look at the current resources being used by a >> similar web ui. >> >> However, I would like to look at the logs after the job is finished to >> evaluate such things as how many tasks were completed, how many executors >> were used, etc. I currently save my logs to S3. >> >> Thanks! >> >> Henry >> >> -- >> Paul Henry Tremblay >> Robert Half Technology >> > > -- Paul Henry Tremblay Robert Half Technology