Hello everyone, I'm interested in using heka for our metrics pipeline. Basically, there would be 3 stages: 1. ingest: publicly exposed endpoints that authenticate sessions, take in metrics data, and pump it into rabbitmq 2. pull data out of rabbit as quickly as possible (i hear it doesn't do well with large buffers of data and "slow acks") and safely (disk buffer backed) move data into kairosdb. this is where heka would fit in 3. kairosdb (cassandra based) is our timeseries storage system.
long term I think it'd be nice to take out as many moving parts (our own ingest endpoints, rabbitmq, kairosdb) and just put heka straight in between the user and cassandra, but that would be some future project. I have started work on a heka output and encoder plugin for kairosdb's rest endpoint. I have a few questions though: 1. in case of any failures, we rather have duplicate deliveries than lost messages. how do we minimize lost messages? is it true that every piece of the pipeline (rabbitmq input, router, kairosdb output) has a chan with buffer of 50? or can we just ack to rabbit once the message has been synced to disk, and how long should this take, and how many messages could be in flight (out of rabbit but not acked yet. i heard rabbit doesn't deal with this too well)? I couldn't find any config option to the disk buffer that controls how often/after how much data sync() is called. 2. after a long kairosdb outage, with a full disk buffer, is it possible to prioritize delivery of real-time messages and backfill historical data at a rate that is either automatically managed based on kairosdb backpressure or operator-controlled? Ideally we'ld like to ensure real-time messages always get delivered properly and historical data gets backfilled using "spare capacity". From what I can see, everything goes through the disk buffer in a FIFO style. This is not a showstopper though, I think if this scenario manifests itself, we can just spin up new heka's and send all new realtime data to them and launch the old heka's with full disk buffers in such a way they receive no data and send data at a limited rate configured as part of the kairos output plugin or by timing how long the posts take. 3. if we have a heka in "send-from-disk-buffer-only" mode, we should be able to safely kill and restart it right? assuming the kairosdb output is properly written to update the cursor if it gets the ack from kairosdb. 4. disk buffer full_action shutdown: the docs say "Heka will stop all processing and attempt a clean shutdown", does this generally work fine, or are you aware of cases where this can cause trouble? 5. is it possible for the disk buffer to corrupt (bad hard drive/fs/cosmic rays aside) 6. any other potential gotcha's to consider? thanks! Dieter _______________________________________________ Heka mailing list Heka@mozilla.org https://mail.mozilla.org/listinfo/heka