I'd definitely recommend against writing a custom splitter.

TokenSplitter already supports a `count` setting, so if you don't need your size to be precise, and there's not a huge variation in your line lengths, you could just set that to some number that gives you close to 1MB in length and be done.

Otherwise a filter would be fine, it'd be quite simple. I'd recommend sizing each line as you receive it and keeping a running total of the current size. I'd also recommend making sure you store your current size and the buffered data in a global variable and use `preserve_data = true` so your buffer will survive restarts.

In either case, you're going to want to increase your max_message_size from the default of 64KiB. Also you'll want to make sure you don't try to send your data out over UDP, since most UDP stacks only support packets up to 64KiB in size.

-r


On 02/16/2016 04:42 PM, Eli Flesher wrote:
Hi Everyone,

I’m using Heka to ingest logs from a large cluster of boxes and then
move them off to an API. I have a working prototype, but right now I’m
using the LogStreamer and default TokenSplitter to create a message per
log line.

This is not optimal, given that it creates a large number of messages
and thus API calls on the support system.

I’d like to buffer up to 1MB of log data and create messages in 1MB
chunks. Reading over the Extending heka documentation, it seems that
this could be done with either a Splitter plugin or a Filter plugin.

The ides of the splitter would be similar to the TokenSplitter. It would
check to see if the byte slice of data passed in is over the buffer size
and select a buffer’s worth for a message from the slice and ‘read’ that
much. It would otherwise indicated 0 bytes read as the TokenSplitter
does. I’m wondering if I would need to tweak any of the global
configuration options to make this work (e.g. max_message_loop,
plugin_chansize or max_message_size).

Alternatively, I’m thinking of implementing a filter that collects these
messages in a buffer and flushes the buffer when the desired size is
reached. The problem with this is I’ll have multiple log streams and I
wouldn’t want to cross the streams. Also, as much as possible, I’d like
to preserve the order of lines within a single chunk (messages
themselves are encoded with a timestamp for later reassembly).

Thoughts on either approach?


Thanks,


Eli
--
—
*Elijah Flesher*  | *Lyft* <http://lyft.me/>  | /Software Engineer/
206.661.4697  |  @eliflesher


_______________________________________________
Heka mailing list
Heka@mozilla.org
https://mail.mozilla.org/listinfo/heka

_______________________________________________
Heka mailing list
Heka@mozilla.org
https://mail.mozilla.org/listinfo/heka

Reply via email to