Re: Resilient Producer

2015-01-29 Thread David Morales
Existing tail source is not the best choice in your scenario, as you have
pointed out.

SpoolDir could be a solution if your log file rotation policy is very low
(5 minutes, for example), but then you have to deal with a huge number of
files in the folder (slower listings).

There is a proposal for a new approach, something that combines the best of
tail and spoolDir. Take a look here:

https://issues.apache.org/jira/browse/FLUME-2498




2015-01-29 0:24 GMT+01:00 Lakshmanan Muthuraman lakshma...@tokbox.com:

 We have been using Flume to solve a very similar usecase. Our servers write
 the log files to a local file system, and then we have flume agent which
 ships the data to kafka.

 Flume you can use as exec source running tail. Though the exec source runs
 well with tail, there are issues if the agent goes down or the file channel
 starts building up. If the agent goes down, you can request flume exec tail
 source to go back n number of lines or read from beginning of the file. The
 challenge is we roll our log files on a daily basis. What if goes down in
 the evening. We need to go back to the entire days worth of data for
 reprocessing which slows down the data flow. We can also go back arbitarily
 number of lines, but then we dont know what is the right number to go back.
 This is kind of challenge for us. We have tried spooling directory. Which
 works, but we need to have a different log file rotation policy. We
 considered evening going a file rotation for a minute, but it will  still
 affect the real time data flow in our kafka---storm--Elastic search
 pipeline with a minute delay.

 We are going to do a poc on logstash to see how this solves the problem of
 flume.

 On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. fot...@gmail.com wrote:

  Hi all,
  I'm evaluating using Kafka.
 
  I liked this thing of Facebook scribe that you log to your own machine
 and
  then there's a separate process that forwards messages to the central
  logger.
 
  With Kafka it seems that I have to embed the publisher in my app, and
 deal
  with any communication problem managing that on the producer side.
 
  I googled quite a bit trying to find a project that would basically use
  daemon that parses a log file and send the lines to the Kafka cluster
  (something like a tail file.log but instead of redirecting the output to
  the console: send it to kafka)
 
  Does anyone knows about something like that?
 
 
  Thanks!
  Fernando.
 




-- 

David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
https://twitter.com/dmoralesdf


http://www.stratio.com/
Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
https://twitter.com/StratioBD*


Re: Resilient Producer

2015-01-29 Thread Lakshmanan Muthuraman
Thanks David. This looks to be interesting. Will definitely test this out
to see whether this solves our problem.

On Thu, Jan 29, 2015 at 8:29 AM, David Morales dmora...@stratio.com wrote:

 Existing tail source is not the best choice in your scenario, as you have
 pointed out.

 SpoolDir could be a solution if your log file rotation policy is very low
 (5 minutes, for example), but then you have to deal with a huge number of
 files in the folder (slower listings).

 There is a proposal for a new approach, something that combines the best of
 tail and spoolDir. Take a look here:

 https://issues.apache.org/jira/browse/FLUME-2498




 2015-01-29 0:24 GMT+01:00 Lakshmanan Muthuraman lakshma...@tokbox.com:

  We have been using Flume to solve a very similar usecase. Our servers
 write
  the log files to a local file system, and then we have flume agent which
  ships the data to kafka.
 
  Flume you can use as exec source running tail. Though the exec source
 runs
  well with tail, there are issues if the agent goes down or the file
 channel
  starts building up. If the agent goes down, you can request flume exec
 tail
  source to go back n number of lines or read from beginning of the file.
 The
  challenge is we roll our log files on a daily basis. What if goes down in
  the evening. We need to go back to the entire days worth of data for
  reprocessing which slows down the data flow. We can also go back
 arbitarily
  number of lines, but then we dont know what is the right number to go
 back.
  This is kind of challenge for us. We have tried spooling directory. Which
  works, but we need to have a different log file rotation policy. We
  considered evening going a file rotation for a minute, but it will  still
  affect the real time data flow in our kafka---storm--Elastic search
  pipeline with a minute delay.
 
  We are going to do a poc on logstash to see how this solves the problem
 of
  flume.
 
  On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. fot...@gmail.com wrote:
 
   Hi all,
   I'm evaluating using Kafka.
  
   I liked this thing of Facebook scribe that you log to your own machine
  and
   then there's a separate process that forwards messages to the central
   logger.
  
   With Kafka it seems that I have to embed the publisher in my app, and
  deal
   with any communication problem managing that on the producer side.
  
   I googled quite a bit trying to find a project that would basically use
   daemon that parses a log file and send the lines to the Kafka cluster
   (something like a tail file.log but instead of redirecting the output
 to
   the console: send it to kafka)
  
   Does anyone knows about something like that?
  
  
   Thanks!
   Fernando.
  
 



 --

 David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
 https://twitter.com/dmoralesdf


 http://www.stratio.com/
 Vía de las dos Castillas, 33, Ática 4, 3ª Planta
 28224 Pozuelo de Alarcón, Madrid
 Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
 https://twitter.com/StratioBD*



Re: Resilient Producer

2015-01-28 Thread Lakshmanan Muthuraman
We have been using Flume to solve a very similar usecase. Our servers write
the log files to a local file system, and then we have flume agent which
ships the data to kafka.

Flume you can use as exec source running tail. Though the exec source runs
well with tail, there are issues if the agent goes down or the file channel
starts building up. If the agent goes down, you can request flume exec tail
source to go back n number of lines or read from beginning of the file. The
challenge is we roll our log files on a daily basis. What if goes down in
the evening. We need to go back to the entire days worth of data for
reprocessing which slows down the data flow. We can also go back arbitarily
number of lines, but then we dont know what is the right number to go back.
This is kind of challenge for us. We have tried spooling directory. Which
works, but we need to have a different log file rotation policy. We
considered evening going a file rotation for a minute, but it will  still
affect the real time data flow in our kafka---storm--Elastic search
pipeline with a minute delay.

We are going to do a poc on logstash to see how this solves the problem of
flume.

On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. fot...@gmail.com wrote:

 Hi all,
 I'm evaluating using Kafka.

 I liked this thing of Facebook scribe that you log to your own machine and
 then there's a separate process that forwards messages to the central
 logger.

 With Kafka it seems that I have to embed the publisher in my app, and deal
 with any communication problem managing that on the producer side.

 I googled quite a bit trying to find a project that would basically use
 daemon that parses a log file and send the lines to the Kafka cluster
 (something like a tail file.log but instead of redirecting the output to
 the console: send it to kafka)

 Does anyone knows about something like that?


 Thanks!
 Fernando.



Re: Resilient Producer

2015-01-28 Thread Gwen Shapira
It sounds like you are describing Flume, with SpoolingDirectory source
(or exec source running tail) and Kafka channel.

On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. fot...@gmail.com wrote:
 Hi all,
 I'm evaluating using Kafka.

 I liked this thing of Facebook scribe that you log to your own machine and
 then there's a separate process that forwards messages to the central
 logger.

 With Kafka it seems that I have to embed the publisher in my app, and deal
 with any communication problem managing that on the producer side.

 I googled quite a bit trying to find a project that would basically use
 daemon that parses a log file and send the lines to the Kafka cluster
 (something like a tail file.log but instead of redirecting the output to
 the console: send it to kafka)

 Does anyone knows about something like that?


 Thanks!
 Fernando.


Re: Resilient Producer

2015-01-28 Thread Colin
Logstash

--
Colin Clark 
+1 612 859 6129
Skype colin.p.clark

 On Jan 28, 2015, at 10:47 AM, Gwen Shapira gshap...@cloudera.com wrote:
 
 It sounds like you are describing Flume, with SpoolingDirectory source
 (or exec source running tail) and Kafka channel.
 
 On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. fot...@gmail.com wrote:
 Hi all,
I'm evaluating using Kafka.
 
 I liked this thing of Facebook scribe that you log to your own machine and
 then there's a separate process that forwards messages to the central
 logger.
 
 With Kafka it seems that I have to embed the publisher in my app, and deal
 with any communication problem managing that on the producer side.
 
 I googled quite a bit trying to find a project that would basically use
 daemon that parses a log file and send the lines to the Kafka cluster
 (something like a tail file.log but instead of redirecting the output to
 the console: send it to kafka)
 
 Does anyone knows about something like that?
 
 
 Thanks!
 Fernando.


Re: Resilient Producer

2015-01-28 Thread Magnus Edenhill
The big syslog daemons support Kafka since a while back.

rsyslog:
http://www.rsyslog.com/doc/master/configuration/modules/omkafka.html

syslog-ng:
https://czanik.blogs.balabit.com/2015/01/syslog-ng-kafka-destination-support/#more-1013

And Bruce might be of interest aswell:
https://github.com/tagged/bruce


On the less daemony and more tooly side of things are:

https://github.com/fsaintjacques/tail-kafka
https://github.com/mguindin/tail-kafka
https://github.com/edenhill/kafkacat


2015-01-28 19:47 GMT+01:00 Gwen Shapira gshap...@cloudera.com:

 It sounds like you are describing Flume, with SpoolingDirectory source
 (or exec source running tail) and Kafka channel.

 On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. fot...@gmail.com wrote:
  Hi all,
  I'm evaluating using Kafka.
 
  I liked this thing of Facebook scribe that you log to your own machine
 and
  then there's a separate process that forwards messages to the central
  logger.
 
  With Kafka it seems that I have to embed the publisher in my app, and
 deal
  with any communication problem managing that on the producer side.
 
  I googled quite a bit trying to find a project that would basically use
  daemon that parses a log file and send the lines to the Kafka cluster
  (something like a tail file.log but instead of redirecting the output to
  the console: send it to kafka)
 
  Does anyone knows about something like that?
 
 
  Thanks!
  Fernando.



Re: resilient producer

2013-01-15 Thread Corbin Hoenes
+1 how about posting yours to GitHub? 
Sounds like a good contrib project.

Sent from my iPhone

On Jan 15, 2013, at 12:29 PM, Stan Rosenberg stan.rosenb...@gmail.com wrote:

 Hi,
 
 In out current data ingestion system, producers are resilient in the sense
 that if data cannot be reliably published (e.g., network is down), it is
 spilled onto local disk.
 A separate process runs asynchronously and attempts to publish spilled
 data.  I am curious to hear what other people do in this case.
 Is there a plan to have something similar integrated into kafka?  (AFAIK,
 current implementation gives up after a configurable number of retries.)
 
 Thanks,
 
 stan


Re: resilient producer

2013-01-15 Thread Jay Kreps
I can't speak for all users, but at LinkedIn we don't do this. We just run
Kafka as a high-availability system (i.e. something not allowed to be
down). These kind of systems require more care, but we already have a
number of such data systems. We chose this approach because local queuing
leads to disk/data management problems on all producers (and we have
thousands) and also late data. Late data makes aggregation very hard since
there will always be more data coming so the aggregate ends up not matching
the base data. This has lead us to a path of working on reliability of the
service itself rather than a store-and-forward model. Likewise the model
itself doesn't necessarily work--as you get to thousands of producers, then
some of those will likely go hard down if the cluster has non-trivial
periods of non-availability, and the data you queued locally is gone since
you have no fault-tolerance for that.

So that was our rationale, but you could easily go the other way. There is
nothing in kafka that prevents producer-side queueing. I could imagine two
possible implementations:
1. Many people who want this are basically doing log aggregation. If this
is the case the collector process on the machine would just pause its
collecting if the cluster is unavailable.
2. Alternately it would be possible to embed the kafka log (which is a
standalone system) in the producer and use it for journalling in the case
of errors. Then there could be a background thread that tries to push these
stored messages out.
3. One could just catch any exceptions the producer throws and implement
(2) external to the Kafka client.

-Jay


On Tue, Jan 15, 2013 at 11:29 AM, Stan Rosenberg
stan.rosenb...@gmail.comwrote:

 Hi,

 In out current data ingestion system, producers are resilient in the sense
 that if data cannot be reliably published (e.g., network is down), it is
 spilled onto local disk.
 A separate process runs asynchronously and attempts to publish spilled
 data.  I am curious to hear what other people do in this case.
 Is there a plan to have something similar integrated into kafka?  (AFAIK,
 current implementation gives up after a configurable number of retries.)

 Thanks,

 stan



Re: resilient producer

2013-01-15 Thread Stan Rosenberg
Jay,

Thanks for your insight!   More comments are below.

On Tue, Jan 15, 2013 at 3:18 PM, Jay Kreps jay.kr...@gmail.com wrote:

 I can't speak for all users, but at LinkedIn we don't do this. We just run
 Kafka as a high-availability system (i.e. something not allowed to be
 down). These kind of systems require more care, but we already have a
 number of such data systems. We chose this approach because local queuing
 leads to disk/data management problems on all producers (and we have
 thousands) and also late data. Late data makes aggregation very hard since
 there will always be more data coming so the aggregate ends up not matching
 the base data.


Yep, we're facing the same problem with respect to late data.  I'd like to
see alternative solutions to this problem, but I am afraid it's an
undecidable problem in general.


 This has lead us to a path of working on reliability of the service itself
 rather than a store-and-forward model.

Likewise the model itself doesn't necessarily work--as you get to thousands
 of producers, then some of those will likely go hard down if the cluster
 has non-trivial periods of non-availability, and the data you queued
 locally is gone since you have no fault-tolerance for that.


Right.  So, you're essentially trading late data for potentially lost data?



 So that was our rationale, but you could easily go the other way. There is
 nothing in kafka that prevents producer-side queueing. I could imagine two
 possible implementations:
 1. Many people who want this are basically doing log aggregation. If this
 is the case the collector process on the machine would just pause its
 collecting if the cluster is unavailable.
 2. Alternately it would be possible to embed the kafka log (which is a
 standalone system) in the producer and use it for journalling in the case
 of errors. Then there could be a background thread that tries to push these
 stored messages out.
 3. One could just catch any exceptions the producer throws and implement
 (2) external to the Kafka client.


Option 2 sounds promising.