Hello,

I'm investigating on a solution to real-time monitor Spark logs produced by
my EMR cluster in order to collect statistics and trigger alarms. Being on
EMR, I found the CloudWatch Logs + Lambda pretty straightforward and, since
I'm on AWS, those service are pretty well integrated together..but I could
just find examples about it using on standalone EC2 instances.

In my use case, EMR 3.9 and Spark 1.4.1 drivers running on YARN (cluster
mode), I would like to be able to real-time monitor Spark logs, so not just
about when the processing ends and they are copied to S3. Is there any
out-of-the-box solution or best-practice for accomplish this goal when
running on EMR that I'm not aware of?

Spark logs are written on the Data Nodes (Core Instances) local file
systems as YARN containers logs, so probably installing the awslogs agent
on them and pointing to those logfiles would help pushing such logs on
CloudWatch, but I was wondering how the community real-time monitors
application logs when running Spark on YARN on EMR.

Or maybe I'm looking at a wrong solution. Maybe the correct way would be
using something like a CloudwatchSink so to make Spark (log4j) pushing logs
directly to the sink and the sink pushing them to CloudWatch (I do like the
out-of-the-box EMR logging experience and I want to keep the usual eventual
logs archiving on S3 when the EMR cluster is terminated).

Any ideas or experience about this problem?

Thank you.

Roberto

Reply via email to