[
https://issues.apache.org/jira/browse/SAMZA-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148161#comment-14148161
]
Chris Riccomini commented on SAMZA-310:
---------------------------------------
Oof, yea, this is confusing.
bq. 2. If my understanding is correct, one "containerId" may contains a few
"taskId" and "taskName"?
Correct. The relation between taskId and taskName is 1:1. The taskId is just an
implementation detail of the YARN AM at this point. We should only need to
expose taskName. I'm hoping to clean some of this up as part of SAMZA-348 when
we refactor the AM into a generic job coordinator.
bq. If 2) is true, seems we can only partition the logs based on "containerId"
because we will set MDC for one container, can not set it for each task.
Technically, we MIGHT be able to update the MDC with a taskName every time we
drop into the TaskInstance code, but then the question is what do we do with
the container-level logs? In this situation you'd end up with all tasks from
writing to different partitions, as well as the container. This would make it
impossible to construct a single container's logs in time-order, which is the
most useful thing, IMO.
bq. However, the "containerId" changes if one container fails and a new
container is brought up.
I think this is OK. A single container's logs will be fully ordered within a
single partition. If the container restarts, the new logs might end up on
another partition. This is effectively how logs work now with YARN, as well.
> Publish container logs to a SystemStream
> ----------------------------------------
>
> Key: SAMZA-310
> URL: https://issues.apache.org/jira/browse/SAMZA-310
> Project: Samza
> Issue Type: New Feature
> Components: container
> Affects Versions: 0.7.0
> Reporter: Martin Kleppmann
> Assignee: Yan Fang
>
> At the moment, it's a bit awkward to get to a Samza job's logs: assuming
> you're running on YARN, you have to navigate around the YARN web interface,
> and you can only see one container's logs at a time.
> Given that Samza is all about streams, it would make sense for the logs
> generated by Samza jobs to also be sent to a stream. There, they could be
> indexed with [Kibana|http://www.elasticsearch.org/overview/kibana/], consumed
> by an exception-tracking system, etc.
> Notes:
> - The serde for encoding logs into a suitable wire format should be
> pluggable. There can be a default implementation that uses JSON, analogous to
> MetricsSnapshotSerdeFactory for metrics, but organisations that already have
> a standardised in-house encoding for logs should be able to use it.
> - Should this be at the level of Slf4j or Log4j? Currently the log
> configuration for YARN jobs uses Log4j, which has the advantage that any
> frameworks/libraries that use Log4j but not Slf4j appear in the logs.
> However, Samza itself currently only depends on Slf4j. If we tie this feature
> to Log4j, it would somewhat defeat the purpose of using Slf4j.
> - Do we need to consider partitioning? Perhaps we can use the container name
> as partitioning key, so that the ordering of logs from each container is
> preserved.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)