Hello, I'm investigating on a solution to real-time monitor Spark logs produced by my EMR cluster in order to collect statistics and trigger alarms. Being on EMR, I found the CloudWatch Logs + Lambda pretty straightforward and, since I'm on AWS, those service are pretty well integrated together..but I could just find examples about it using on standalone EC2 instances.
In my use case, EMR 3.9 and Spark 1.4.1 drivers running on YARN (cluster mode), I would like to be able to real-time monitor Spark logs, so not just about when the processing ends and they are copied to S3. Is there any out-of-the-box solution or best-practice for accomplish this goal when running on EMR that I'm not aware of? Spark logs are written on the Data Nodes (Core Instances) local file systems as YARN containers logs, so probably installing the awslogs agent on them and pointing to those logfiles would help pushing such logs on CloudWatch, but I was wondering how the community real-time monitors application logs when running Spark on YARN on EMR. Or maybe I'm looking at a wrong solution. Maybe the correct way would be using something like a CloudwatchSink so to make Spark (log4j) pushing logs directly to the sink and the sink pushing them to CloudWatch (I do like the out-of-the-box EMR logging experience and I want to keep the usual eventual logs archiving on S3 when the EMR cluster is terminated). Any ideas or experience about this problem? Thank you. Roberto