[ 
https://issues.apache.org/jira/browse/SAMZA-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14040980#comment-14040980
 ] 

Martin Kleppmann commented on SAMZA-300:
----------------------------------------

bq. I am not sure in which situation two jobs will publish to the same 
changelog/checkpoint stream ? From my understanding, all jobs will be assigned 
different kafka topic names.

The checkpoint stream name is generated from the {{job.name}} and {{job.id}}, 
but afaik nothing is currently enforcing uniqueness of those properties. You 
can easily run two jobs with the same name and ID, and they will write to the 
same checkpoint topic.

The changelog stream name is set in the configuration. If an inexperienced user 
is copying and pasting config from another job, they might not know that they 
need to change the changelog stream name, and thus end up starting another job 
that writes to the same stream.

Both of these are quite likely scenarios, I think, so they would be worth 
preventing.

bq. users accidentally publish wrong messages to the checkpoint/changelog topic 
with whatever methods, such as command line

Using CheckpointTool is another good example, actually. Using it to modify a 
checkpoint only works if the job currently isn't running (otherwise the job 
will never read the modified checkpoint). A write lock would prevent a futile 
use of CheckpointTool.

bq. Maybe throw the relevant information to anther stream which is for the 
dashboard?

That's a possibility, and would probably work well for metrics data (e.g. it 
would be nice to annotate the job graph visualization with messages/sec on the 
streams).

For a lock, Zookeeper would probably work better, as it could use an ephemeral 
node (if the job's AM and/or Kafka producer dies, the lock would automatically 
be released). However, I'm not sure whether this would better be implemented in 
Kafka or in Samza.

> Track producers and consumers of streams
> ----------------------------------------
>
>                 Key: SAMZA-300
>                 URL: https://issues.apache.org/jira/browse/SAMZA-300
>             Project: Samza
>          Issue Type: New Feature
>            Reporter: Martin Kleppmann
>
> Each Samza job runs independently, which has a lot of advantages. However, 
> there are situations in which it would be valuable to have a global overview 
> of the data flows between jobs. For example:
> - It's important for correctness that only one job ever publishes to a given 
> checkpoint or changelog stream — if several jobs publish to the same stream, 
> the result is nonsensical. However, we currently have no way of enforcing 
> that. It would be good if a job could take a "write lock" on a stream, and 
> thus prevent others from writing to it.
> - It would be awesome to have a dashboard/visualization that graphically 
> shows the job graph, and visually highlights the health of a job (e.g. 
> whether a job is fallen behind).
> - The job graph would also be generally useful for tracking data provenance 
> (finding consumers who would be affected by a schema change, finding the team 
> that is responsible for producing a particular stream, etc)
> - Potentially could include additional metadata about streams, e.g. owner, 
> serialization format, schema, documentation of semantics of the data, etc. 
> (HCatalog for streams?)
> One possibility would be for Kafka to add some of this functionality, 
> although it may also make sense to implement it in Samza (that way it would 
> be available for non-Kafka systems as well, and could use knowledge about the 
> job that Samza has, but Kafka hasn't).
> This is just a vague description to start a discussion. Please comment with 
> your ideas on how to best implement this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to