Re: Resilient Producer
Existing tail source is not the best choice in your scenario, as you have pointed out. SpoolDir could be a solution if your log file rotation policy is very low (5 minutes, for example), but then you have to deal with a huge number of files in the folder (slower listings). There is a proposal for a new approach, something that combines the best of tail and spoolDir. Take a look here: https://issues.apache.org/jira/browse/FLUME-2498 2015-01-29 0:24 GMT+01:00 Lakshmanan Muthuraman lakshma...@tokbox.com: We have been using Flume to solve a very similar usecase. Our servers write the log files to a local file system, and then we have flume agent which ships the data to kafka. Flume you can use as exec source running tail. Though the exec source runs well with tail, there are issues if the agent goes down or the file channel starts building up. If the agent goes down, you can request flume exec tail source to go back n number of lines or read from beginning of the file. The challenge is we roll our log files on a daily basis. What if goes down in the evening. We need to go back to the entire days worth of data for reprocessing which slows down the data flow. We can also go back arbitarily number of lines, but then we dont know what is the right number to go back. This is kind of challenge for us. We have tried spooling directory. Which works, but we need to have a different log file rotation policy. We considered evening going a file rotation for a minute, but it will still affect the real time data flow in our kafka---storm--Elastic search pipeline with a minute delay. We are going to do a poc on logstash to see how this solves the problem of flume. On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. fot...@gmail.com wrote: Hi all, I'm evaluating using Kafka. I liked this thing of Facebook scribe that you log to your own machine and then there's a separate process that forwards messages to the central logger. With Kafka it seems that I have to embed the publisher in my app, and deal with any communication problem managing that on the producer side. I googled quite a bit trying to find a project that would basically use daemon that parses a log file and send the lines to the Kafka cluster (something like a tail file.log but instead of redirecting the output to the console: send it to kafka) Does anyone knows about something like that? Thanks! Fernando. -- David Morales de Frías :: +34 607 010 411 :: @dmoralesdf https://twitter.com/dmoralesdf http://www.stratio.com/ Vía de las dos Castillas, 33, Ática 4, 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd https://twitter.com/StratioBD*
Re: Resilient Producer
Thanks David. This looks to be interesting. Will definitely test this out to see whether this solves our problem. On Thu, Jan 29, 2015 at 8:29 AM, David Morales dmora...@stratio.com wrote: Existing tail source is not the best choice in your scenario, as you have pointed out. SpoolDir could be a solution if your log file rotation policy is very low (5 minutes, for example), but then you have to deal with a huge number of files in the folder (slower listings). There is a proposal for a new approach, something that combines the best of tail and spoolDir. Take a look here: https://issues.apache.org/jira/browse/FLUME-2498 2015-01-29 0:24 GMT+01:00 Lakshmanan Muthuraman lakshma...@tokbox.com: We have been using Flume to solve a very similar usecase. Our servers write the log files to a local file system, and then we have flume agent which ships the data to kafka. Flume you can use as exec source running tail. Though the exec source runs well with tail, there are issues if the agent goes down or the file channel starts building up. If the agent goes down, you can request flume exec tail source to go back n number of lines or read from beginning of the file. The challenge is we roll our log files on a daily basis. What if goes down in the evening. We need to go back to the entire days worth of data for reprocessing which slows down the data flow. We can also go back arbitarily number of lines, but then we dont know what is the right number to go back. This is kind of challenge for us. We have tried spooling directory. Which works, but we need to have a different log file rotation policy. We considered evening going a file rotation for a minute, but it will still affect the real time data flow in our kafka---storm--Elastic search pipeline with a minute delay. We are going to do a poc on logstash to see how this solves the problem of flume. On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. fot...@gmail.com wrote: Hi all, I'm evaluating using Kafka. I liked this thing of Facebook scribe that you log to your own machine and then there's a separate process that forwards messages to the central logger. With Kafka it seems that I have to embed the publisher in my app, and deal with any communication problem managing that on the producer side. I googled quite a bit trying to find a project that would basically use daemon that parses a log file and send the lines to the Kafka cluster (something like a tail file.log but instead of redirecting the output to the console: send it to kafka) Does anyone knows about something like that? Thanks! Fernando. -- David Morales de Frías :: +34 607 010 411 :: @dmoralesdf https://twitter.com/dmoralesdf http://www.stratio.com/ Vía de las dos Castillas, 33, Ática 4, 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd https://twitter.com/StratioBD*
Re: Resilient Producer
We have been using Flume to solve a very similar usecase. Our servers write the log files to a local file system, and then we have flume agent which ships the data to kafka. Flume you can use as exec source running tail. Though the exec source runs well with tail, there are issues if the agent goes down or the file channel starts building up. If the agent goes down, you can request flume exec tail source to go back n number of lines or read from beginning of the file. The challenge is we roll our log files on a daily basis. What if goes down in the evening. We need to go back to the entire days worth of data for reprocessing which slows down the data flow. We can also go back arbitarily number of lines, but then we dont know what is the right number to go back. This is kind of challenge for us. We have tried spooling directory. Which works, but we need to have a different log file rotation policy. We considered evening going a file rotation for a minute, but it will still affect the real time data flow in our kafka---storm--Elastic search pipeline with a minute delay. We are going to do a poc on logstash to see how this solves the problem of flume. On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. fot...@gmail.com wrote: Hi all, I'm evaluating using Kafka. I liked this thing of Facebook scribe that you log to your own machine and then there's a separate process that forwards messages to the central logger. With Kafka it seems that I have to embed the publisher in my app, and deal with any communication problem managing that on the producer side. I googled quite a bit trying to find a project that would basically use daemon that parses a log file and send the lines to the Kafka cluster (something like a tail file.log but instead of redirecting the output to the console: send it to kafka) Does anyone knows about something like that? Thanks! Fernando.
Re: Resilient Producer
It sounds like you are describing Flume, with SpoolingDirectory source (or exec source running tail) and Kafka channel. On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. fot...@gmail.com wrote: Hi all, I'm evaluating using Kafka. I liked this thing of Facebook scribe that you log to your own machine and then there's a separate process that forwards messages to the central logger. With Kafka it seems that I have to embed the publisher in my app, and deal with any communication problem managing that on the producer side. I googled quite a bit trying to find a project that would basically use daemon that parses a log file and send the lines to the Kafka cluster (something like a tail file.log but instead of redirecting the output to the console: send it to kafka) Does anyone knows about something like that? Thanks! Fernando.
Re: Resilient Producer
Logstash -- Colin Clark +1 612 859 6129 Skype colin.p.clark On Jan 28, 2015, at 10:47 AM, Gwen Shapira gshap...@cloudera.com wrote: It sounds like you are describing Flume, with SpoolingDirectory source (or exec source running tail) and Kafka channel. On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. fot...@gmail.com wrote: Hi all, I'm evaluating using Kafka. I liked this thing of Facebook scribe that you log to your own machine and then there's a separate process that forwards messages to the central logger. With Kafka it seems that I have to embed the publisher in my app, and deal with any communication problem managing that on the producer side. I googled quite a bit trying to find a project that would basically use daemon that parses a log file and send the lines to the Kafka cluster (something like a tail file.log but instead of redirecting the output to the console: send it to kafka) Does anyone knows about something like that? Thanks! Fernando.
Re: Resilient Producer
The big syslog daemons support Kafka since a while back. rsyslog: http://www.rsyslog.com/doc/master/configuration/modules/omkafka.html syslog-ng: https://czanik.blogs.balabit.com/2015/01/syslog-ng-kafka-destination-support/#more-1013 And Bruce might be of interest aswell: https://github.com/tagged/bruce On the less daemony and more tooly side of things are: https://github.com/fsaintjacques/tail-kafka https://github.com/mguindin/tail-kafka https://github.com/edenhill/kafkacat 2015-01-28 19:47 GMT+01:00 Gwen Shapira gshap...@cloudera.com: It sounds like you are describing Flume, with SpoolingDirectory source (or exec source running tail) and Kafka channel. On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. fot...@gmail.com wrote: Hi all, I'm evaluating using Kafka. I liked this thing of Facebook scribe that you log to your own machine and then there's a separate process that forwards messages to the central logger. With Kafka it seems that I have to embed the publisher in my app, and deal with any communication problem managing that on the producer side. I googled quite a bit trying to find a project that would basically use daemon that parses a log file and send the lines to the Kafka cluster (something like a tail file.log but instead of redirecting the output to the console: send it to kafka) Does anyone knows about something like that? Thanks! Fernando.
Re: resilient producer
+1 how about posting yours to GitHub? Sounds like a good contrib project. Sent from my iPhone On Jan 15, 2013, at 12:29 PM, Stan Rosenberg stan.rosenb...@gmail.com wrote: Hi, In out current data ingestion system, producers are resilient in the sense that if data cannot be reliably published (e.g., network is down), it is spilled onto local disk. A separate process runs asynchronously and attempts to publish spilled data. I am curious to hear what other people do in this case. Is there a plan to have something similar integrated into kafka? (AFAIK, current implementation gives up after a configurable number of retries.) Thanks, stan
Re: resilient producer
I can't speak for all users, but at LinkedIn we don't do this. We just run Kafka as a high-availability system (i.e. something not allowed to be down). These kind of systems require more care, but we already have a number of such data systems. We chose this approach because local queuing leads to disk/data management problems on all producers (and we have thousands) and also late data. Late data makes aggregation very hard since there will always be more data coming so the aggregate ends up not matching the base data. This has lead us to a path of working on reliability of the service itself rather than a store-and-forward model. Likewise the model itself doesn't necessarily work--as you get to thousands of producers, then some of those will likely go hard down if the cluster has non-trivial periods of non-availability, and the data you queued locally is gone since you have no fault-tolerance for that. So that was our rationale, but you could easily go the other way. There is nothing in kafka that prevents producer-side queueing. I could imagine two possible implementations: 1. Many people who want this are basically doing log aggregation. If this is the case the collector process on the machine would just pause its collecting if the cluster is unavailable. 2. Alternately it would be possible to embed the kafka log (which is a standalone system) in the producer and use it for journalling in the case of errors. Then there could be a background thread that tries to push these stored messages out. 3. One could just catch any exceptions the producer throws and implement (2) external to the Kafka client. -Jay On Tue, Jan 15, 2013 at 11:29 AM, Stan Rosenberg stan.rosenb...@gmail.comwrote: Hi, In out current data ingestion system, producers are resilient in the sense that if data cannot be reliably published (e.g., network is down), it is spilled onto local disk. A separate process runs asynchronously and attempts to publish spilled data. I am curious to hear what other people do in this case. Is there a plan to have something similar integrated into kafka? (AFAIK, current implementation gives up after a configurable number of retries.) Thanks, stan
Re: resilient producer
Jay, Thanks for your insight! More comments are below. On Tue, Jan 15, 2013 at 3:18 PM, Jay Kreps jay.kr...@gmail.com wrote: I can't speak for all users, but at LinkedIn we don't do this. We just run Kafka as a high-availability system (i.e. something not allowed to be down). These kind of systems require more care, but we already have a number of such data systems. We chose this approach because local queuing leads to disk/data management problems on all producers (and we have thousands) and also late data. Late data makes aggregation very hard since there will always be more data coming so the aggregate ends up not matching the base data. Yep, we're facing the same problem with respect to late data. I'd like to see alternative solutions to this problem, but I am afraid it's an undecidable problem in general. This has lead us to a path of working on reliability of the service itself rather than a store-and-forward model. Likewise the model itself doesn't necessarily work--as you get to thousands of producers, then some of those will likely go hard down if the cluster has non-trivial periods of non-availability, and the data you queued locally is gone since you have no fault-tolerance for that. Right. So, you're essentially trading late data for potentially lost data? So that was our rationale, but you could easily go the other way. There is nothing in kafka that prevents producer-side queueing. I could imagine two possible implementations: 1. Many people who want this are basically doing log aggregation. If this is the case the collector process on the machine would just pause its collecting if the cluster is unavailable. 2. Alternately it would be possible to embed the kafka log (which is a standalone system) in the producer and use it for journalling in the case of errors. Then there could be a background thread that tries to push these stored messages out. 3. One could just catch any exceptions the producer throws and implement (2) external to the Kafka client. Option 2 sounds promising.