[ 
https://issues.apache.org/jira/browse/SAMZA-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172742#comment-14172742
 ] 

Yan Fang commented on SAMZA-310:
--------------------------------

{quote}
If we use MDC, the ConversionPattern can refer to variables such as job name, 
task ID, etc in the log lines. 
{quote}

yes, agree. This is the big benefit MDC can provide.

{quote}
have a sane default topic name (e.g. __samza-<job name>-logs, or something)
{quote}

yeah, it's good to have a default topic. I would also like to allow users to 
set their own topic name. Maybe they want to publish logs from different jobs 
to the same topic name, thought it could be rare.

{quote}
The AM knows the total number of containers (config.getTaskCount) and both the 
AM and SamzaContainer know the job name (config.getName). If the AM were to 
pass the container count (sadly, named task count right now), via an 
environment variable
{quote}

Do you mean, we pass the environment variables CONTAINER_COUNT, CONTAINER_ID 
(TASK_ID)? _CONTAINER_COUNT_ seems a little awkward because it has nothing to 
do with the "environment" thought it's really straightforward. Maybe we parse a 
CONTAINER_ID as the format of , say, "0/7, 1/7, 2/7" ? Numerator is the id and 
the denominator is the container count. However, this approach makes the 
CONTAINER_ID not intuitive and leaves the appender to parse the CONTAINER_ID to 
get the total count.

When we use the MDC, we want to make this optional from the performance 
perspective, right? (haven't done the performance test because I did not use 
MDC. Now a simple performance test is needed.) So my questions are:

1) what is an appropriate property name? such as job.enable.mdc=true? 
2) how do we pass this property? a) we read the config in AM, and set an 
environment variable (again, sadly, environment variable...) to let containers 
know they should set MDC. b) AM and containers all read the config. But then 
name fashion of job.*** is not appropriate. 



> Publish container logs to a SystemStream
> ----------------------------------------
>
>                 Key: SAMZA-310
>                 URL: https://issues.apache.org/jira/browse/SAMZA-310
>             Project: Samza
>          Issue Type: New Feature
>          Components: container
>    Affects Versions: 0.7.0
>            Reporter: Martin Kleppmann
>            Assignee: Yan Fang
>         Attachments: SAMZA-310.patch
>
>
> At the moment, it's a bit awkward to get to a Samza job's logs: assuming 
> you're running on YARN, you have to navigate around the YARN web interface, 
> and you can only see one container's logs at a time.
> Given that Samza is all about streams, it would make sense for the logs 
> generated by Samza jobs to also be sent to a stream. There, they could be 
> indexed with [Kibana|http://www.elasticsearch.org/overview/kibana/], consumed 
> by an exception-tracking system, etc.
> Notes:
> - The serde for encoding logs into a suitable wire format should be 
> pluggable. There can be a default implementation that uses JSON, analogous to 
> MetricsSnapshotSerdeFactory for metrics, but organisations that already have 
> a standardised in-house encoding for logs should be able to use it.
> - Should this be at the level of Slf4j or Log4j? Currently the log 
> configuration for YARN jobs uses Log4j, which has the advantage that any 
> frameworks/libraries that use Log4j but not Slf4j appear in the logs. 
> However, Samza itself currently only depends on Slf4j. If we tie this feature 
> to Log4j, it would somewhat defeat the purpose of using Slf4j.
> - Do we need to consider partitioning? Perhaps we can use the container name 
> as partitioning key, so that the ordering of logs from each container is 
> preserved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to