Thanks for your advice, Steve. I'm mainly talking about application logs. To be more clear, just for instance think about the "/<log-dir>/hadoop/userlogs/application_blablabla/container_blablabla/stderr_or_stdout". So YARN's applications containers logs, stored (at least for EMR's hadoop 2.4) on DataNodes and aggregated/pushed only once the application completes.
"yarn logs" issued from the cluster Master doesn't allow you to on-demand aggregate logs for applications the are in running/active state. For now I managed to install the awslogs agent ( http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/CWL_GettingStarted.html) on DataNodes so to push containers logs in real-time to CloudWatch logs, but that's kinda of a workaround too, this is why I was wondering what the community (in general, not only on EMR) uses to real-time monitor application logs (in an automated fashion) for long-running processes like streaming driver and if are there out-of-the-box solutions. Thanks, Roberto On Thu, Dec 10, 2015 at 3:06 PM, Steve Loughran <ste...@hortonworks.com> wrote: > > > On 10 Dec 2015, at 14:52, Roberto Coluccio <roberto.coluc...@gmail.com> > wrote: > > > > Hello, > > > > I'm investigating on a solution to real-time monitor Spark logs produced > by my EMR cluster in order to collect statistics and trigger alarms. Being > on EMR, I found the CloudWatch Logs + Lambda pretty straightforward and, > since I'm on AWS, those service are pretty well integrated together..but I > could just find examples about it using on standalone EC2 instances. > > > > In my use case, EMR 3.9 and Spark 1.4.1 drivers running on YARN (cluster > mode), I would like to be able to real-time monitor Spark logs, so not just > about when the processing ends and they are copied to S3. Is there any > out-of-the-box solution or best-practice for accomplish this goal when > running on EMR that I'm not aware of? > > > > Spark logs are written on the Data Nodes (Core Instances) local file > systems as YARN containers logs, so probably installing the awslogs agent > on them and pointing to those logfiles would help pushing such logs on > CloudWatch, but I was wondering how the community real-time monitors > application logs when running Spark on YARN on EMR. > > > > Or maybe I'm looking at a wrong solution. Maybe the correct way would be > using something like a CloudwatchSink so to make Spark (log4j) pushing logs > directly to the sink and the sink pushing them to CloudWatch (I do like the > out-of-the-box EMR logging experience and I want to keep the usual eventual > logs archiving on S3 when the EMR cluster is terminated). > > > > Any ideas or experience about this problem? > > > > Thank you. > > > > Roberto > > > are you talking about event logs as used by the history server, or > application logs? > > the current spark log server writes events to a file, but as the hadoop s3 > fs client doesn't write except in close(), they won't be pushed out while > thing are running. Someone (you?) could have a go at implementing a new > event listener; some stuff that will come out in Spark 2.0 will make it > easier to wire this up (SPARK-11314), which is coming as part of some work > on spark-YARN timelineserver itnegration. > > In Hadoop 2.7.1 The log4j logs can be regularly captured by the Yarn > Nodemanagers and automatically copied out, look at > yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds . For > that to work you need to set up your log wildcard patterns to for the NM to > locate (i.e. have rolling logs with the right extensions)...the details > escape me right now > > In earlier versions, you can use "yarn logs' to grab them and pull them > down. > > I don't know anything about cloudwatch integration, sorry > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >