[jira] [Commented] (KAFKA-260) Add audit trail to kafka

2014-10-27 Thread Stas Levin (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14185220#comment-14185220
 ] 

Stas Levin commented on KAFKA-260:
--

Hi guys,

We've adopted the data model above in Aletheia 
(https://github.com/outbrain/Aletheia), an open source data delivery framework 
we've been working on here at Outbrain. 
In Aletheia we call these audit trails "Breadcrumbs", and have them generated 
by the producer and consumer sides. We're working towards integrating the above 
mentioned patch in order to provide a client side dashboard.

Aletheia is by no means meant to replace Kafka, it is rather an abstraction 
layer on top of Kafka and other messaging systems, as we point out in the wiki.
Having audit capabilities built into Kafka would be really great, meanwhile, 
you're most welcome to check out Aletheia, perhaps you'll find it useful as it 
provides the Breadcrumb generation out of the box.

-Stas

> Add audit trail to kafka
> 
>
> Key: KAFKA-260
> URL: https://issues.apache.org/jira/browse/KAFKA-260
> Project: Kafka
>  Issue Type: New Feature
>Affects Versions: 0.8.0
>Reporter: Jay Kreps
>Assignee: Jay Kreps
> Attachments: Picture 18.png, kafka-audit-trail-draft.patch
>
>
> LinkedIn has a system that does monitoring on top of our data flow to ensure 
> all data is delivered to all consumers of data. This works by having each 
> logical "tier" through which data passes produce messages to a central 
> "audit-trail" topic; these messages give a time period and the number of 
> messages that passed through that tier in that time period. Example of tiers 
> for data might be "producer", "broker", "hadoop-etl", etc. This makes it 
> possible to compare the total events for a given time period to ensure that 
> all events that are produced are consumed by all consumers.
> This turns out to be extremely useful. We also have an application that 
> "balances the books" and checks that all data is consumed in a timely 
> fashion. This gives graphs for each topic and shows any data loss and the lag 
> at which the data is consumed (if any).
> This would be an optional feature that would allow you to to this kind of 
> reconciliation automatically for all the topics kafka hosts against all the 
> tiers of applications that interact with the data.
> Some details, the proposed format of the data is JSON using the following 
> format for messages:
> {
>   "time":1301727060032,  // the timestamp at which this audit message is sent
>   "topic": "my_topic_name", // the topic this audit data is for
>   "tier":"producer", // a user-defined "tier" name
>   "bucket_start": 130172640, // the beginning of the time bucket this 
> data applies to
>   "bucket_end": 130172700, // the end of the time bucket this data 
> applies to
>   "host":"my_host_name.datacenter.linkedin.com", // the server that this was 
> sent from
>   "datacenter":"hlx32", // the datacenter this occurred in
>   "application":"newsfeed_service", // a user-defined application name
>   "guid":"51656274-a86a-4dff-b824-8e8e20a6348f", // a unique identifier for 
> this message
>   "count":43634
> }
> DISCUSSION
> Time is complex:
> 1. The audit data must be based on a timestamp in the events not the time on 
> machine processing the event. Using this timestamp means that all downstream 
> consumers will report audit data on the right time bucket. This means that 
> there must be a timestamp in the event, which we don't currently require. 
> Arguably we should just add a timestamp to the events, but I think it is 
> sufficient for now just to allow the user to provide a function to extract 
> the time from their events.
> 2. For counts to reconcile exactly we can only do analysis at a granularity 
> based on the least common multiple of the bucket size used by all tiers. The 
> simplest is just to configure them all to use the same bucket size. We 
> currently use a bucket size of 10 mins, but anything from 1-60 mins is 
> probably reasonable.
> For analysis purposes one tier is designated as the source tier and we do 
> reconciliation against this count (e.g. if another tier has less, that is 
> treated as lost, if another tier has more that is duplication).
> Note that this system makes false positives possible since you can lose an 
> audit message. It also makes false negatives possible since if you lose both 
> normal messages and the associated audit messages it will appear that 
> everything adds up. The later problem is astronomically unlikely to happen 
> exactly, though.
> This would integrate into the client (producer and consumer both) in the 
> following way:
> 1. The user provides a way to get timestamps from messages (required)
> 2. The user configures the tier name, host name, datacenter name, and 
> application name as part of

[jira] [Commented] (KAFKA-260) Add audit trail to kafka

2014-02-25 Thread Roger Hoover (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13911909#comment-13911909
 ] 

Roger Hoover commented on KAFKA-260:


"It also makes false negatives possible since if you lose both normal messages 
and the associated audit messages it will appear that everything adds up. The 
later problem is astronomically unlikely to happen exactly, though."

This may be true once messages have reached a broker.  However, if a producer 
process were to be killed (say by SIGKILL), both it's commited messages and the 
audit data would be lost.  Would it make sense to persist the audit counts to 
the file system for producers so that they could potentially be recovered?

> Add audit trail to kafka
> 
>
> Key: KAFKA-260
> URL: https://issues.apache.org/jira/browse/KAFKA-260
> Project: Kafka
>  Issue Type: New Feature
>Affects Versions: 0.8.0
>Reporter: Jay Kreps
>Assignee: Jay Kreps
> Attachments: Picture 18.png, kafka-audit-trail-draft.patch
>
>
> LinkedIn has a system that does monitoring on top of our data flow to ensure 
> all data is delivered to all consumers of data. This works by having each 
> logical "tier" through which data passes produce messages to a central 
> "audit-trail" topic; these messages give a time period and the number of 
> messages that passed through that tier in that time period. Example of tiers 
> for data might be "producer", "broker", "hadoop-etl", etc. This makes it 
> possible to compare the total events for a given time period to ensure that 
> all events that are produced are consumed by all consumers.
> This turns out to be extremely useful. We also have an application that 
> "balances the books" and checks that all data is consumed in a timely 
> fashion. This gives graphs for each topic and shows any data loss and the lag 
> at which the data is consumed (if any).
> This would be an optional feature that would allow you to to this kind of 
> reconciliation automatically for all the topics kafka hosts against all the 
> tiers of applications that interact with the data.
> Some details, the proposed format of the data is JSON using the following 
> format for messages:
> {
>   "time":1301727060032,  // the timestamp at which this audit message is sent
>   "topic": "my_topic_name", // the topic this audit data is for
>   "tier":"producer", // a user-defined "tier" name
>   "bucket_start": 130172640, // the beginning of the time bucket this 
> data applies to
>   "bucket_end": 130172700, // the end of the time bucket this data 
> applies to
>   "host":"my_host_name.datacenter.linkedin.com", // the server that this was 
> sent from
>   "datacenter":"hlx32", // the datacenter this occurred in
>   "application":"newsfeed_service", // a user-defined application name
>   "guid":"51656274-a86a-4dff-b824-8e8e20a6348f", // a unique identifier for 
> this message
>   "count":43634
> }
> DISCUSSION
> Time is complex:
> 1. The audit data must be based on a timestamp in the events not the time on 
> machine processing the event. Using this timestamp means that all downstream 
> consumers will report audit data on the right time bucket. This means that 
> there must be a timestamp in the event, which we don't currently require. 
> Arguably we should just add a timestamp to the events, but I think it is 
> sufficient for now just to allow the user to provide a function to extract 
> the time from their events.
> 2. For counts to reconcile exactly we can only do analysis at a granularity 
> based on the least common multiple of the bucket size used by all tiers. The 
> simplest is just to configure them all to use the same bucket size. We 
> currently use a bucket size of 10 mins, but anything from 1-60 mins is 
> probably reasonable.
> For analysis purposes one tier is designated as the source tier and we do 
> reconciliation against this count (e.g. if another tier has less, that is 
> treated as lost, if another tier has more that is duplication).
> Note that this system makes false positives possible since you can lose an 
> audit message. It also makes false negatives possible since if you lose both 
> normal messages and the associated audit messages it will appear that 
> everything adds up. The later problem is astronomically unlikely to happen 
> exactly, though.
> This would integrate into the client (producer and consumer both) in the 
> following way:
> 1. The user provides a way to get timestamps from messages (required)
> 2. The user configures the tier name, host name, datacenter name, and 
> application name as part of the consumer and producer config. We can provide 
> reasonable defaults if not supplied (e.g. if it is a Producer then set tier 
> to "producer" and get the hostname from the OS).
> The application that processes this data is 

[jira] [Commented] (KAFKA-260) Add audit trail to kafka

2014-01-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/KAFKA-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886455#comment-13886455
 ] 

Daniel Tardón commented on KAFKA-260:
-

Hi all,

Ashok, do you have any news about this kafka-audit topic? Did you resolve your 
doubts?

I am looking for this kind of feature.

Regards

> Add audit trail to kafka
> 
>
> Key: KAFKA-260
> URL: https://issues.apache.org/jira/browse/KAFKA-260
> Project: Kafka
>  Issue Type: New Feature
>Affects Versions: 0.8.0
>Reporter: Jay Kreps
>Assignee: Jay Kreps
> Attachments: Picture 18.png, kafka-audit-trail-draft.patch
>
>
> LinkedIn has a system that does monitoring on top of our data flow to ensure 
> all data is delivered to all consumers of data. This works by having each 
> logical "tier" through which data passes produce messages to a central 
> "audit-trail" topic; these messages give a time period and the number of 
> messages that passed through that tier in that time period. Example of tiers 
> for data might be "producer", "broker", "hadoop-etl", etc. This makes it 
> possible to compare the total events for a given time period to ensure that 
> all events that are produced are consumed by all consumers.
> This turns out to be extremely useful. We also have an application that 
> "balances the books" and checks that all data is consumed in a timely 
> fashion. This gives graphs for each topic and shows any data loss and the lag 
> at which the data is consumed (if any).
> This would be an optional feature that would allow you to to this kind of 
> reconciliation automatically for all the topics kafka hosts against all the 
> tiers of applications that interact with the data.
> Some details, the proposed format of the data is JSON using the following 
> format for messages:
> {
>   "time":1301727060032,  // the timestamp at which this audit message is sent
>   "topic": "my_topic_name", // the topic this audit data is for
>   "tier":"producer", // a user-defined "tier" name
>   "bucket_start": 130172640, // the beginning of the time bucket this 
> data applies to
>   "bucket_end": 130172700, // the end of the time bucket this data 
> applies to
>   "host":"my_host_name.datacenter.linkedin.com", // the server that this was 
> sent from
>   "datacenter":"hlx32", // the datacenter this occurred in
>   "application":"newsfeed_service", // a user-defined application name
>   "guid":"51656274-a86a-4dff-b824-8e8e20a6348f", // a unique identifier for 
> this message
>   "count":43634
> }
> DISCUSSION
> Time is complex:
> 1. The audit data must be based on a timestamp in the events not the time on 
> machine processing the event. Using this timestamp means that all downstream 
> consumers will report audit data on the right time bucket. This means that 
> there must be a timestamp in the event, which we don't currently require. 
> Arguably we should just add a timestamp to the events, but I think it is 
> sufficient for now just to allow the user to provide a function to extract 
> the time from their events.
> 2. For counts to reconcile exactly we can only do analysis at a granularity 
> based on the least common multiple of the bucket size used by all tiers. The 
> simplest is just to configure them all to use the same bucket size. We 
> currently use a bucket size of 10 mins, but anything from 1-60 mins is 
> probably reasonable.
> For analysis purposes one tier is designated as the source tier and we do 
> reconciliation against this count (e.g. if another tier has less, that is 
> treated as lost, if another tier has more that is duplication).
> Note that this system makes false positives possible since you can lose an 
> audit message. It also makes false negatives possible since if you lose both 
> normal messages and the associated audit messages it will appear that 
> everything adds up. The later problem is astronomically unlikely to happen 
> exactly, though.
> This would integrate into the client (producer and consumer both) in the 
> following way:
> 1. The user provides a way to get timestamps from messages (required)
> 2. The user configures the tier name, host name, datacenter name, and 
> application name as part of the consumer and producer config. We can provide 
> reasonable defaults if not supplied (e.g. if it is a Producer then set tier 
> to "producer" and get the hostname from the OS).
> The application that processes this data is currently a Java Jetty app and 
> talks to mysql. It feeds off the audit topic in kafka and runs both automatic 
> monitoring checks and graphical displays of data against this. The data layer 
> is not terribly scalable but because the audit data is sent only periodically 
> this is enough to allow us to audit thousands of servers on very modest 
> hardware, and having sql access makes divin

[jira] [Commented] (KAFKA-260) Add audit trail to kafka

2014-01-06 Thread Ashok Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863417#comment-13863417
 ] 

Ashok Gupta commented on KAFKA-260:
---

Hi Jay, 

Is the final kafka-audit code available in kafka 0.8 or somewhere else? I don't 
see a top level directory with name "audit" in the kafka source.

I see the audit mentioned in the documentation at 
http://kafka.apache.org/documentation.html, however I am not able to find more 
details about how to configure and use it.

> Add audit trail to kafka
> 
>
> Key: KAFKA-260
> URL: https://issues.apache.org/jira/browse/KAFKA-260
> Project: Kafka
>  Issue Type: New Feature
>Affects Versions: 0.8.0
>Reporter: Jay Kreps
>Assignee: Jay Kreps
> Attachments: Picture 18.png, kafka-audit-trail-draft.patch
>
>
> LinkedIn has a system that does monitoring on top of our data flow to ensure 
> all data is delivered to all consumers of data. This works by having each 
> logical "tier" through which data passes produce messages to a central 
> "audit-trail" topic; these messages give a time period and the number of 
> messages that passed through that tier in that time period. Example of tiers 
> for data might be "producer", "broker", "hadoop-etl", etc. This makes it 
> possible to compare the total events for a given time period to ensure that 
> all events that are produced are consumed by all consumers.
> This turns out to be extremely useful. We also have an application that 
> "balances the books" and checks that all data is consumed in a timely 
> fashion. This gives graphs for each topic and shows any data loss and the lag 
> at which the data is consumed (if any).
> This would be an optional feature that would allow you to to this kind of 
> reconciliation automatically for all the topics kafka hosts against all the 
> tiers of applications that interact with the data.
> Some details, the proposed format of the data is JSON using the following 
> format for messages:
> {
>   "time":1301727060032,  // the timestamp at which this audit message is sent
>   "topic": "my_topic_name", // the topic this audit data is for
>   "tier":"producer", // a user-defined "tier" name
>   "bucket_start": 130172640, // the beginning of the time bucket this 
> data applies to
>   "bucket_end": 130172700, // the end of the time bucket this data 
> applies to
>   "host":"my_host_name.datacenter.linkedin.com", // the server that this was 
> sent from
>   "datacenter":"hlx32", // the datacenter this occurred in
>   "application":"newsfeed_service", // a user-defined application name
>   "guid":"51656274-a86a-4dff-b824-8e8e20a6348f", // a unique identifier for 
> this message
>   "count":43634
> }
> DISCUSSION
> Time is complex:
> 1. The audit data must be based on a timestamp in the events not the time on 
> machine processing the event. Using this timestamp means that all downstream 
> consumers will report audit data on the right time bucket. This means that 
> there must be a timestamp in the event, which we don't currently require. 
> Arguably we should just add a timestamp to the events, but I think it is 
> sufficient for now just to allow the user to provide a function to extract 
> the time from their events.
> 2. For counts to reconcile exactly we can only do analysis at a granularity 
> based on the least common multiple of the bucket size used by all tiers. The 
> simplest is just to configure them all to use the same bucket size. We 
> currently use a bucket size of 10 mins, but anything from 1-60 mins is 
> probably reasonable.
> For analysis purposes one tier is designated as the source tier and we do 
> reconciliation against this count (e.g. if another tier has less, that is 
> treated as lost, if another tier has more that is duplication).
> Note that this system makes false positives possible since you can lose an 
> audit message. It also makes false negatives possible since if you lose both 
> normal messages and the associated audit messages it will appear that 
> everything adds up. The later problem is astronomically unlikely to happen 
> exactly, though.
> This would integrate into the client (producer and consumer both) in the 
> following way:
> 1. The user provides a way to get timestamps from messages (required)
> 2. The user configures the tier name, host name, datacenter name, and 
> application name as part of the consumer and producer config. We can provide 
> reasonable defaults if not supplied (e.g. if it is a Producer then set tier 
> to "producer" and get the hostname from the OS).
> The application that processes this data is currently a Java Jetty app and 
> talks to mysql. It feeds off the audit topic in kafka and runs both automatic 
> monitoring checks and graphical displays of data against this. The data layer 
> is not terribly sc

[jira] [Commented] (KAFKA-260) Add audit trail to kafka

2013-01-17 Thread Felix GV (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556958#comment-13556958
 ] 

Felix GV commented on KAFKA-260:


It would be possible to have optional timestamps by using the magic byte at the 
beginning of the Kafka Messages, no? If the Message contains the old (current) 
magic byte, then there's no timestamp, if it's the new magic byte, then there 
is a timestamp (without needing a key) somewhere in the header...

> Add audit trail to kafka
> 
>
> Key: KAFKA-260
> URL: https://issues.apache.org/jira/browse/KAFKA-260
> Project: Kafka
>  Issue Type: New Feature
>Affects Versions: 0.8
>Reporter: Jay Kreps
>Assignee: Jay Kreps
> Attachments: kafka-audit-trail-draft.patch, Picture 18.png
>
>
> LinkedIn has a system that does monitoring on top of our data flow to ensure 
> all data is delivered to all consumers of data. This works by having each 
> logical "tier" through which data passes produce messages to a central 
> "audit-trail" topic; these messages give a time period and the number of 
> messages that passed through that tier in that time period. Example of tiers 
> for data might be "producer", "broker", "hadoop-etl", etc. This makes it 
> possible to compare the total events for a given time period to ensure that 
> all events that are produced are consumed by all consumers.
> This turns out to be extremely useful. We also have an application that 
> "balances the books" and checks that all data is consumed in a timely 
> fashion. This gives graphs for each topic and shows any data loss and the lag 
> at which the data is consumed (if any).
> This would be an optional feature that would allow you to to this kind of 
> reconciliation automatically for all the topics kafka hosts against all the 
> tiers of applications that interact with the data.
> Some details, the proposed format of the data is JSON using the following 
> format for messages:
> {
>   "time":1301727060032,  // the timestamp at which this audit message is sent
>   "topic": "my_topic_name", // the topic this audit data is for
>   "tier":"producer", // a user-defined "tier" name
>   "bucket_start": 130172640, // the beginning of the time bucket this 
> data applies to
>   "bucket_end": 130172700, // the end of the time bucket this data 
> applies to
>   "host":"my_host_name.datacenter.linkedin.com", // the server that this was 
> sent from
>   "datacenter":"hlx32", // the datacenter this occurred in
>   "application":"newsfeed_service", // a user-defined application name
>   "guid":"51656274-a86a-4dff-b824-8e8e20a6348f", // a unique identifier for 
> this message
>   "count":43634
> }
> DISCUSSION
> Time is complex:
> 1. The audit data must be based on a timestamp in the events not the time on 
> machine processing the event. Using this timestamp means that all downstream 
> consumers will report audit data on the right time bucket. This means that 
> there must be a timestamp in the event, which we don't currently require. 
> Arguably we should just add a timestamp to the events, but I think it is 
> sufficient for now just to allow the user to provide a function to extract 
> the time from their events.
> 2. For counts to reconcile exactly we can only do analysis at a granularity 
> based on the least common multiple of the bucket size used by all tiers. The 
> simplest is just to configure them all to use the same bucket size. We 
> currently use a bucket size of 10 mins, but anything from 1-60 mins is 
> probably reasonable.
> For analysis purposes one tier is designated as the source tier and we do 
> reconciliation against this count (e.g. if another tier has less, that is 
> treated as lost, if another tier has more that is duplication).
> Note that this system makes false positives possible since you can lose an 
> audit message. It also makes false negatives possible since if you lose both 
> normal messages and the associated audit messages it will appear that 
> everything adds up. The later problem is astronomically unlikely to happen 
> exactly, though.
> This would integrate into the client (producer and consumer both) in the 
> following way:
> 1. The user provides a way to get timestamps from messages (required)
> 2. The user configures the tier name, host name, datacenter name, and 
> application name as part of the consumer and producer config. We can provide 
> reasonable defaults if not supplied (e.g. if it is a Producer then set tier 
> to "producer" and get the hostname from the OS).
> The application that processes this data is currently a Java Jetty app and 
> talks to mysql. It feeds off the audit topic in kafka and runs both automatic 
> monitoring checks and graphical displays of data against this. The data layer 
> is not terribly scalable but becaus