Thanks for your advice, Steve.

I'm mainly talking about application logs. To be more clear, just for
instance think about the
"/<log-dir>/hadoop/userlogs/application_blablabla/container_blablabla/stderr_or_stdout".
So YARN's applications containers logs, stored (at least for EMR's hadoop
2.4) on DataNodes and aggregated/pushed only once the application completes.

"yarn logs" issued from the cluster Master doesn't allow you to on-demand
aggregate logs for applications the are in running/active state.

For now I managed to install the awslogs agent (
http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/CWL_GettingStarted.html)
on
DataNodes so to push containers logs in real-time to CloudWatch logs, but
that's kinda of a workaround too, this is why I was wondering what the
community (in general, not only on EMR) uses to real-time monitor
application logs (in an automated fashion) for long-running processes like
streaming driver and if are there out-of-the-box solutions.

Thanks,

Roberto





On Thu, Dec 10, 2015 at 3:06 PM, Steve Loughran <ste...@hortonworks.com>
wrote:

>
> > On 10 Dec 2015, at 14:52, Roberto Coluccio <roberto.coluc...@gmail.com>
> wrote:
> >
> > Hello,
> >
> > I'm investigating on a solution to real-time monitor Spark logs produced
> by my EMR cluster in order to collect statistics and trigger alarms. Being
> on EMR, I found the CloudWatch Logs + Lambda pretty straightforward and,
> since I'm on AWS, those service are pretty well integrated together..but I
> could just find examples about it using on standalone EC2 instances.
> >
> > In my use case, EMR 3.9 and Spark 1.4.1 drivers running on YARN (cluster
> mode), I would like to be able to real-time monitor Spark logs, so not just
> about when the processing ends and they are copied to S3. Is there any
> out-of-the-box solution or best-practice for accomplish this goal when
> running on EMR that I'm not aware of?
> >
> > Spark logs are written on the Data Nodes (Core Instances) local file
> systems as YARN containers logs, so probably installing the awslogs agent
> on them and pointing to those logfiles would help pushing such logs on
> CloudWatch, but I was wondering how the community real-time monitors
> application logs when running Spark on YARN on EMR.
> >
> > Or maybe I'm looking at a wrong solution. Maybe the correct way would be
> using something like a CloudwatchSink so to make Spark (log4j) pushing logs
> directly to the sink and the sink pushing them to CloudWatch (I do like the
> out-of-the-box EMR logging experience and I want to keep the usual eventual
> logs archiving on S3 when the EMR cluster is terminated).
> >
> > Any ideas or experience about this problem?
> >
> > Thank you.
> >
> > Roberto
>
>
> are you talking about event logs as used by the history server, or
> application logs?
>
> the current spark log server writes events to a file, but as the hadoop s3
> fs client doesn't write except in close(), they won't be pushed out while
> thing are running. Someone (you?) could have a go at implementing a new
> event listener; some stuff that will come out in Spark 2.0 will make it
> easier to wire this up (SPARK-11314), which is coming as part of some work
> on spark-YARN timelineserver itnegration.
>
> In Hadoop 2.7.1 The log4j logs can be regularly captured by the Yarn
> Nodemanagers and automatically copied out, look at
> yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds . For
> that to work you need to set up your log wildcard patterns to for the NM to
> locate (i.e. have rolling logs with the right extensions)...the details
> escape me right now
>
> In earlier versions, you can use "yarn logs' to grab them and pull them
> down.
>
> I don't know anything about cloudwatch integration, sorry
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to