[heka] interested in using heka for a reliable metrics pipeline. have some questions

Dieter Plaetinck Tue, 04 Aug 2015 03:55:31 -0700

Hello everyone,

I'm interested in using heka for our metrics pipeline.
Basically, there would be 3 stages:
1. ingest: publicly exposed endpoints that authenticate sessions, take in 
metrics data, and pump it into rabbitmq
2. pull data out of rabbit as quickly as possible (i hear it doesn't do well 
with large buffers of data and "slow acks") and safely (disk buffer backed) 
move data into kairosdb. this is where heka would fit in
3. kairosdb (cassandra based) is our timeseries storage system.


long term I think it'd be nice to take out as many moving parts (our own ingest 
endpoints, rabbitmq, kairosdb) and just put heka straight in between the user 
and cassandra, but that would be some future project.

I have started work on a heka output and encoder plugin for kairosdb's rest 
endpoint.
I have a few questions though:

1. in case of any failures, we rather have duplicate deliveries than lost 
messages.  how do we minimize lost messages?
  is it true that every piece of the pipeline (rabbitmq input, router, kairosdb 
output) has a chan with buffer of 50?
  or can we just ack to rabbit once the message has been synced to disk, and 
how long should this take, and how many messages could be in flight (out of 
rabbit but not acked yet. i heard rabbit doesn't deal with this too well)?
  I couldn't find any config option to the disk buffer that controls how 
often/after how much data sync() is called.
2. after a long kairosdb outage, with a full disk buffer, is it possible to 
prioritize delivery of real-time messages and backfill historical data at a 
rate that is either
  automatically managed based on kairosdb backpressure or operator-controlled? 
Ideally we'ld like to ensure real-time messages always get delivered properly 
and historical
  data gets backfilled using "spare capacity".  From what I can see, everything 
goes through the disk buffer in a FIFO style.  This is not a showstopper though,
  I think if this scenario manifests itself, we can just spin up new heka's and 
send all new realtime data to them and launch the old heka's with full disk 
buffers in such
  a way they receive no data and send data at a limited rate configured as part 
of the kairos output plugin or by timing how long the posts take.
3. if we have a heka in "send-from-disk-buffer-only" mode, we should be able to 
safely kill and restart it right? assuming the kairosdb output is properly 
written to update the
  cursor if it gets the ack from kairosdb.
4. disk buffer full_action shutdown: the docs say "Heka will stop all 
processing and attempt a clean shutdown", does this generally work fine, or are 
you aware of cases where this can cause trouble?
5. is it possible for the disk buffer to corrupt (bad hard drive/fs/cosmic rays 
aside)
6. any other potential gotcha's to consider?

thanks!

Dieter
_______________________________________________
Heka mailing list
Heka@mozilla.org
https://mail.mozilla.org/listinfo/heka

[heka] interested in using heka for a reliable metrics pipeline. have some questions

Reply via email to