Re: Resilient Producer

2015-01-30 Thread Otis Gospodnetic
Fernando, have a look -
http://blog.sematext.com/2014/10/06/top-5-most-popular-log-shippers/

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Wed, Jan 28, 2015 at 1:39 PM, Fernando O.  wrote:

> Hi all,
> I'm evaluating using Kafka.
>
> I liked this thing of Facebook scribe that you log to your own machine and
> then there's a separate process that forwards messages to the central
> logger.
>
> With Kafka it seems that I have to embed the publisher in my app, and deal
> with any communication problem managing that on the producer side.
>
> I googled quite a bit trying to find a project that would basically use
> daemon that parses a log file and send the lines to the Kafka cluster
> (something like a tail file.log but instead of redirecting the output to
> the console: send it to kafka)
>
> Does anyone knows about something like that?
>
>
> Thanks!
> Fernando.
>


Re: Resilient Producer

2015-01-29 Thread Lakshmanan Muthuraman
Thanks David. This looks to be interesting. Will definitely test this out
to see whether this solves our problem.

On Thu, Jan 29, 2015 at 8:29 AM, David Morales  wrote:

> Existing "tail" source is not the best choice in your scenario, as you have
> pointed out.
>
> SpoolDir could be a solution if your log file rotation policy is very low
> (5 minutes, for example), but then you have to deal with a huge number of
> files in the folder (slower listings).
>
> There is a proposal for a new approach, something that combines the best of
> "tail" and "spoolDir". Take a look here:
>
> https://issues.apache.org/jira/browse/FLUME-2498
>
>
>
>
> 2015-01-29 0:24 GMT+01:00 Lakshmanan Muthuraman :
>
> > We have been using Flume to solve a very similar usecase. Our servers
> write
> > the log files to a local file system, and then we have flume agent which
> > ships the data to kafka.
> >
> > Flume you can use as exec source running tail. Though the exec source
> runs
> > well with tail, there are issues if the agent goes down or the file
> channel
> > starts building up. If the agent goes down, you can request flume exec
> tail
> > source to go back n number of lines or read from beginning of the file.
> The
> > challenge is we roll our log files on a daily basis. What if goes down in
> > the evening. We need to go back to the entire days worth of data for
> > reprocessing which slows down the data flow. We can also go back
> arbitarily
> > number of lines, but then we dont know what is the right number to go
> back.
> > This is kind of challenge for us. We have tried spooling directory. Which
> > works, but we need to have a different log file rotation policy. We
> > considered evening going a file rotation for a minute, but it will  still
> > affect the real time data flow in our kafka--->storm-->Elastic search
> > pipeline with a minute delay.
> >
> > We are going to do a poc on logstash to see how this solves the problem
> of
> > flume.
> >
> > On Wed, Jan 28, 2015 at 10:39 AM, Fernando O.  wrote:
> >
> > > Hi all,
> > > I'm evaluating using Kafka.
> > >
> > > I liked this thing of Facebook scribe that you log to your own machine
> > and
> > > then there's a separate process that forwards messages to the central
> > > logger.
> > >
> > > With Kafka it seems that I have to embed the publisher in my app, and
> > deal
> > > with any communication problem managing that on the producer side.
> > >
> > > I googled quite a bit trying to find a project that would basically use
> > > daemon that parses a log file and send the lines to the Kafka cluster
> > > (something like a tail file.log but instead of redirecting the output
> to
> > > the console: send it to kafka)
> > >
> > > Does anyone knows about something like that?
> > >
> > >
> > > Thanks!
> > > Fernando.
> > >
> >
>
>
>
> --
>
> David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
> 
>
>
> 
> Vía de las dos Castillas, 33, Ática 4, 3ª Planta
> 28224 Pozuelo de Alarcón, Madrid
> Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
> *
>


Re: Resilient Producer

2015-01-29 Thread David Morales
Existing "tail" source is not the best choice in your scenario, as you have
pointed out.

SpoolDir could be a solution if your log file rotation policy is very low
(5 minutes, for example), but then you have to deal with a huge number of
files in the folder (slower listings).

There is a proposal for a new approach, something that combines the best of
"tail" and "spoolDir". Take a look here:

https://issues.apache.org/jira/browse/FLUME-2498




2015-01-29 0:24 GMT+01:00 Lakshmanan Muthuraman :

> We have been using Flume to solve a very similar usecase. Our servers write
> the log files to a local file system, and then we have flume agent which
> ships the data to kafka.
>
> Flume you can use as exec source running tail. Though the exec source runs
> well with tail, there are issues if the agent goes down or the file channel
> starts building up. If the agent goes down, you can request flume exec tail
> source to go back n number of lines or read from beginning of the file. The
> challenge is we roll our log files on a daily basis. What if goes down in
> the evening. We need to go back to the entire days worth of data for
> reprocessing which slows down the data flow. We can also go back arbitarily
> number of lines, but then we dont know what is the right number to go back.
> This is kind of challenge for us. We have tried spooling directory. Which
> works, but we need to have a different log file rotation policy. We
> considered evening going a file rotation for a minute, but it will  still
> affect the real time data flow in our kafka--->storm-->Elastic search
> pipeline with a minute delay.
>
> We are going to do a poc on logstash to see how this solves the problem of
> flume.
>
> On Wed, Jan 28, 2015 at 10:39 AM, Fernando O.  wrote:
>
> > Hi all,
> > I'm evaluating using Kafka.
> >
> > I liked this thing of Facebook scribe that you log to your own machine
> and
> > then there's a separate process that forwards messages to the central
> > logger.
> >
> > With Kafka it seems that I have to embed the publisher in my app, and
> deal
> > with any communication problem managing that on the producer side.
> >
> > I googled quite a bit trying to find a project that would basically use
> > daemon that parses a log file and send the lines to the Kafka cluster
> > (something like a tail file.log but instead of redirecting the output to
> > the console: send it to kafka)
> >
> > Does anyone knows about something like that?
> >
> >
> > Thanks!
> > Fernando.
> >
>



-- 

David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf




Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
*


Re: Resilient Producer

2015-01-28 Thread Lakshmanan Muthuraman
We have been using Flume to solve a very similar usecase. Our servers write
the log files to a local file system, and then we have flume agent which
ships the data to kafka.

Flume you can use as exec source running tail. Though the exec source runs
well with tail, there are issues if the agent goes down or the file channel
starts building up. If the agent goes down, you can request flume exec tail
source to go back n number of lines or read from beginning of the file. The
challenge is we roll our log files on a daily basis. What if goes down in
the evening. We need to go back to the entire days worth of data for
reprocessing which slows down the data flow. We can also go back arbitarily
number of lines, but then we dont know what is the right number to go back.
This is kind of challenge for us. We have tried spooling directory. Which
works, but we need to have a different log file rotation policy. We
considered evening going a file rotation for a minute, but it will  still
affect the real time data flow in our kafka--->storm-->Elastic search
pipeline with a minute delay.

We are going to do a poc on logstash to see how this solves the problem of
flume.

On Wed, Jan 28, 2015 at 10:39 AM, Fernando O.  wrote:

> Hi all,
> I'm evaluating using Kafka.
>
> I liked this thing of Facebook scribe that you log to your own machine and
> then there's a separate process that forwards messages to the central
> logger.
>
> With Kafka it seems that I have to embed the publisher in my app, and deal
> with any communication problem managing that on the producer side.
>
> I googled quite a bit trying to find a project that would basically use
> daemon that parses a log file and send the lines to the Kafka cluster
> (something like a tail file.log but instead of redirecting the output to
> the console: send it to kafka)
>
> Does anyone knows about something like that?
>
>
> Thanks!
> Fernando.
>


Re: Resilient Producer

2015-01-28 Thread Magnus Edenhill
The big syslog daemons support Kafka since a while back.

rsyslog:
http://www.rsyslog.com/doc/master/configuration/modules/omkafka.html

syslog-ng:
https://czanik.blogs.balabit.com/2015/01/syslog-ng-kafka-destination-support/#more-1013

And Bruce might be of interest aswell:
https://github.com/tagged/bruce


On the less daemony and more tooly side of things are:

https://github.com/fsaintjacques/tail-kafka
https://github.com/mguindin/tail-kafka
https://github.com/edenhill/kafkacat


2015-01-28 19:47 GMT+01:00 Gwen Shapira :

> It sounds like you are describing Flume, with SpoolingDirectory source
> (or exec source running tail) and Kafka channel.
>
> On Wed, Jan 28, 2015 at 10:39 AM, Fernando O.  wrote:
> > Hi all,
> > I'm evaluating using Kafka.
> >
> > I liked this thing of Facebook scribe that you log to your own machine
> and
> > then there's a separate process that forwards messages to the central
> > logger.
> >
> > With Kafka it seems that I have to embed the publisher in my app, and
> deal
> > with any communication problem managing that on the producer side.
> >
> > I googled quite a bit trying to find a project that would basically use
> > daemon that parses a log file and send the lines to the Kafka cluster
> > (something like a tail file.log but instead of redirecting the output to
> > the console: send it to kafka)
> >
> > Does anyone knows about something like that?
> >
> >
> > Thanks!
> > Fernando.
>


Re: Resilient Producer

2015-01-28 Thread Colin
Logstash

--
Colin Clark 
+1 612 859 6129
Skype colin.p.clark

> On Jan 28, 2015, at 10:47 AM, Gwen Shapira  wrote:
> 
> It sounds like you are describing Flume, with SpoolingDirectory source
> (or exec source running tail) and Kafka channel.
> 
>> On Wed, Jan 28, 2015 at 10:39 AM, Fernando O.  wrote:
>> Hi all,
>>I'm evaluating using Kafka.
>> 
>> I liked this thing of Facebook scribe that you log to your own machine and
>> then there's a separate process that forwards messages to the central
>> logger.
>> 
>> With Kafka it seems that I have to embed the publisher in my app, and deal
>> with any communication problem managing that on the producer side.
>> 
>> I googled quite a bit trying to find a project that would basically use
>> daemon that parses a log file and send the lines to the Kafka cluster
>> (something like a tail file.log but instead of redirecting the output to
>> the console: send it to kafka)
>> 
>> Does anyone knows about something like that?
>> 
>> 
>> Thanks!
>> Fernando.


Re: Resilient Producer

2015-01-28 Thread Gwen Shapira
It sounds like you are describing Flume, with SpoolingDirectory source
(or exec source running tail) and Kafka channel.

On Wed, Jan 28, 2015 at 10:39 AM, Fernando O.  wrote:
> Hi all,
> I'm evaluating using Kafka.
>
> I liked this thing of Facebook scribe that you log to your own machine and
> then there's a separate process that forwards messages to the central
> logger.
>
> With Kafka it seems that I have to embed the publisher in my app, and deal
> with any communication problem managing that on the producer side.
>
> I googled quite a bit trying to find a project that would basically use
> daemon that parses a log file and send the lines to the Kafka cluster
> (something like a tail file.log but instead of redirecting the output to
> the console: send it to kafka)
>
> Does anyone knows about something like that?
>
>
> Thanks!
> Fernando.


Re: Resilient Producer

2015-01-28 Thread Fernando O.
Something like Heka but lightweight :D

On Wed, Jan 28, 2015 at 3:39 PM, Fernando O.  wrote:

> Hi all,
> I'm evaluating using Kafka.
>
> I liked this thing of Facebook scribe that you log to your own machine and
> then there's a separate process that forwards messages to the central
> logger.
>
> With Kafka it seems that I have to embed the publisher in my app, and deal
> with any communication problem managing that on the producer side.
>
> I googled quite a bit trying to find a project that would basically use
> daemon that parses a log file and send the lines to the Kafka cluster
> (something like a tail file.log but instead of redirecting the output to
> the console: send it to kafka)
>
> Does anyone knows about something like that?
>
>
> Thanks!
> Fernando.
>


Re: resilient producer

2013-01-15 Thread Stan Rosenberg
On Tue, Jan 15, 2013 at 3:12 PM, Corbin Hoenes  wrote:

> +1 how about posting yours to GitHub?
> Sounds like a good contrib project.
>

There is nothing to post at the moment as we're currently in the
requirements gathering phase :)  Potentially, we might have a contrib
project along the lines
of options (2), (3) as per Jay's email.


Re: resilient producer

2013-01-15 Thread Stan Rosenberg
Jay,

Thanks for your insight!   More comments are below.

On Tue, Jan 15, 2013 at 3:18 PM, Jay Kreps  wrote:

> I can't speak for all users, but at LinkedIn we don't do this. We just run
> Kafka as a high-availability system (i.e. something not allowed to be
> down). These kind of systems require more care, but we already have a
> number of such data systems. We chose this approach because local queuing
> leads to disk/data management problems on all producers (and we have
> thousands) and also late data. Late data makes aggregation very hard since
> there will always be more data coming so the aggregate ends up not matching
> the base data.
>

Yep, we're facing the same problem with respect to late data.  I'd like to
see alternative solutions to this problem, but I am afraid it's an
undecidable problem in general.


> This has lead us to a path of working on reliability of the service itself
> rather than a store-and-forward model.
>
Likewise the model itself doesn't necessarily work--as you get to thousands
> of producers, then some of those will likely go hard down if the cluster
> has non-trivial periods of non-availability, and the data you queued
> locally is gone since you have no fault-tolerance for that.
>

Right.  So, you're essentially trading late data for potentially lost data?



> So that was our rationale, but you could easily go the other way. There is
> nothing in kafka that prevents producer-side queueing. I could imagine two
> possible implementations:
> 1. Many people who want this are basically doing log aggregation. If this
> is the case the collector process on the machine would just pause its
> collecting if the cluster is unavailable.
> 2. Alternately it would be possible to embed the kafka log (which is a
> standalone system) in the producer and use it for journalling in the case
> of errors. Then there could be a background thread that tries to push these
> stored messages out.
> 3. One could just catch any exceptions the producer throws and implement
> (2) external to the Kafka client.
>

Option 2 sounds promising.


Re: resilient producer

2013-01-15 Thread Jay Kreps
I can't speak for all users, but at LinkedIn we don't do this. We just run
Kafka as a high-availability system (i.e. something not allowed to be
down). These kind of systems require more care, but we already have a
number of such data systems. We chose this approach because local queuing
leads to disk/data management problems on all producers (and we have
thousands) and also late data. Late data makes aggregation very hard since
there will always be more data coming so the aggregate ends up not matching
the base data. This has lead us to a path of working on reliability of the
service itself rather than a store-and-forward model. Likewise the model
itself doesn't necessarily work--as you get to thousands of producers, then
some of those will likely go hard down if the cluster has non-trivial
periods of non-availability, and the data you queued locally is gone since
you have no fault-tolerance for that.

So that was our rationale, but you could easily go the other way. There is
nothing in kafka that prevents producer-side queueing. I could imagine two
possible implementations:
1. Many people who want this are basically doing log aggregation. If this
is the case the collector process on the machine would just pause its
collecting if the cluster is unavailable.
2. Alternately it would be possible to embed the kafka log (which is a
standalone system) in the producer and use it for journalling in the case
of errors. Then there could be a background thread that tries to push these
stored messages out.
3. One could just catch any exceptions the producer throws and implement
(2) external to the Kafka client.

-Jay


On Tue, Jan 15, 2013 at 11:29 AM, Stan Rosenberg
wrote:

> Hi,
>
> In out current data ingestion system, producers are resilient in the sense
> that if data cannot be reliably published (e.g., network is down), it is
> spilled onto local disk.
> A separate process runs asynchronously and attempts to publish spilled
> data.  I am curious to hear what other people do in this case.
> Is there a plan to have something similar integrated into kafka?  (AFAIK,
> current implementation gives up after a configurable number of retries.)
>
> Thanks,
>
> stan
>


Re: resilient producer

2013-01-15 Thread Corbin Hoenes
+1 how about posting yours to GitHub? 
Sounds like a good contrib project.

Sent from my iPhone

On Jan 15, 2013, at 12:29 PM, Stan Rosenberg  wrote:

> Hi,
> 
> In out current data ingestion system, producers are resilient in the sense
> that if data cannot be reliably published (e.g., network is down), it is
> spilled onto local disk.
> A separate process runs asynchronously and attempts to publish spilled
> data.  I am curious to hear what other people do in this case.
> Is there a plan to have something similar integrated into kafka?  (AFAIK,
> current implementation gives up after a configurable number of retries.)
> 
> Thanks,
> 
> stan