[
https://issues.apache.org/jira/browse/SAMZA-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178267#comment-14178267
]
Martin Kleppmann commented on SAMZA-310:
----------------------------------------
Nice work -- I'm glad to see this is happening!
However, I'm not really keen on how the Kafka configuration has to be
duplicated in the log4j config as well as the job config. As the log4j config
will be baked into the zip file that is deployed to a cluster, any change of
the log4j config would require the job package to be rebuilt. It would also
make life difficult for anyone who wants to integrate with an external
configuration management system (e.g. LinkedIn has a centralised config system
for things like Kafka/ZK hostnames, and it deliberately separates deployable
artifacts from their configuration).
I understand the problem that the log appender can't access the job config, but
to me that doesn't feel like a good enough reason to burden users with
duplicating config. Could we arrange it so that the AM passes any necessary
parameters to containers via environment variables or Java system properties?
Then users only need to include them once, in the job config.
Or perhaps the Log4j appender could just use the job config as passed in the
environment variable? (like when the [container starts
up|https://github.com/apache/incubator-samza/blob/master/samza-yarn/src/main/scala/org/apache/samza/job/yarn/SamzaAppMaster.scala#L72]:
{{val config = new
MapConfig(JsonConfigSerializer.fromJson(System.getenv(ShellCommandConfig.ENV_CONFIG)))}})
— It's not beautiful to parse the config twice, but I think it's better than
asking the user to configure in two different places.
> Publish container logs to a SystemStream
> ----------------------------------------
>
> Key: SAMZA-310
> URL: https://issues.apache.org/jira/browse/SAMZA-310
> Project: Samza
> Issue Type: New Feature
> Components: container
> Affects Versions: 0.7.0
> Reporter: Martin Kleppmann
> Assignee: Yan Fang
> Fix For: 0.8.0
>
> Attachments: SAMZA-310.patch
>
>
> At the moment, it's a bit awkward to get to a Samza job's logs: assuming
> you're running on YARN, you have to navigate around the YARN web interface,
> and you can only see one container's logs at a time.
> Given that Samza is all about streams, it would make sense for the logs
> generated by Samza jobs to also be sent to a stream. There, they could be
> indexed with [Kibana|http://www.elasticsearch.org/overview/kibana/], consumed
> by an exception-tracking system, etc.
> Notes:
> - The serde for encoding logs into a suitable wire format should be
> pluggable. There can be a default implementation that uses JSON, analogous to
> MetricsSnapshotSerdeFactory for metrics, but organisations that already have
> a standardised in-house encoding for logs should be able to use it.
> - Should this be at the level of Slf4j or Log4j? Currently the log
> configuration for YARN jobs uses Log4j, which has the advantage that any
> frameworks/libraries that use Log4j but not Slf4j appear in the logs.
> However, Samza itself currently only depends on Slf4j. If we tie this feature
> to Log4j, it would somewhat defeat the purpose of using Slf4j.
> - Do we need to consider partitioning? Perhaps we can use the container name
> as partitioning key, so that the ordering of logs from each container is
> preserved.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)