Re: [heka] TcpInput queue behavior

Rob Miller Wed, 15 Apr 2015 16:09:07 -0700

Reponses inline below...

On 04/15/2015 12:31 PM, Giordano, J C. wrote:

Hi,


I am running an instance of Heka, v0.9.1 on one server as a log shipper
of Apache access logs to a remote Heka instance on another server.  The
log shipper instance is sending the logs via the TcpOutput plugin.  The
problem I’m experiencing is with the queue.  Regardless of the settings
the queue fills up and we either run out of disk space or the instance
of Heka stops working.  Below I’ve described the results of the
different settings used in the config file followed by the configuration
I’m using on the log shipper instance.  I’ve set the ticker_interval
down to 30 seconds due to the volume of Apache logs I’m processing.
  Even so, the queue files fill up moments after starting Heka.

1) With no queue settings (i.e. using the defaults), the queue files
fill up, rotate and accumulate until the disk fills up.

Right, expected behavior.

2) With queue_full_action=shutdown, Heka dutifully stops when the first
queue file reaches capacity.

Still okay.

3) With queue_fill_action=block, Heka pauses when the queue is full and
reports a message to STDOUT:

2015/04/15 14:04:40 Plugin 'TcpOutput' error: Queue is full

But, if I empty the queue file (i.e. cat /dev/null >| 1.log), log
processing never resumes.  Instead, it reports the following message to
STDOUT.  At this point, I have to kill -9 the instance of Heka.

Hrm. This isn't ideal, but I'm not surprised things get wedged. You're deleting 
the queue out from under the output. Heka is blocking until the TCP connection 
is available again, so it's not adding any new records to the queue, but it 
*does* expect the queue to still be there. Ideally we'd recover more gracefully.

2015/04/15 14:06:41 Diagnostics: 32 packs have been idle more than 120
seconds.
2015/04/15 14:06:41 Diagnostics: (input) Plugin names and quantities
found on idle packs:
2015/04/15 14:06:41 Diagnostics: TcpOutput: 32

4) With queue_fill_action=drop, Heka pauses when the queue is full and
reports the following message to STDOUT.  At some point, it creates a
new queue file, zero bytes in size that never gets used. And, the
checkpoint.txt file even refers to the new queue file that got created.
  Yet, if I attempt to empty or delete the filled queue file, Heka never
resume processing logs.

2015/04/15 14:10:05 Plugin 'TcpOutput' error: Queue is full

Not quite sure what's going on here. I can't quite tell from your description 
if anything is wrong. Ideally messages would be dropped for a while, until the 
queue is drained enough for there to be some room for new records. Then new 
messages would be appended to the queue, until such time as the queue hit the 
ceiling again. You shouldn't have to touch it at all, this should Just Work™. 
Is this not happening?

I must add that all at the messages are properly being sent to the
remote Heka instance.

Okay, so this implies that things are working, maybe?

  And, I would expect the TcpOutput plugin to only
queue entries that don’t get sent.

That expectation is wrong. More below.

  As would be the case if the the
remote instance was unreachable.

What is the purpose of the TcpOutput queue files and how do I manage
them so I can use Heka as a log shipper running continuously on a
production server.

I'll explain how things are supposed to work, hopefully that'll help provide
some context.

When buffering is in play, whether in a TcpOutput or an ElasticSearchOutput,
*every* message will go through the buffer. Messages are received via the
plugin's input channel and then (when things are flowing smoothly) immediately
written to disk at the end of the buffer. There is another goroutine running
that is constantly pulling records from earlier in the buffer, where the cursor
points, and trying to send them. If a send is successful, the queue cursor is
updated to the next record, and the process is repeated. If the send fails, the
queue cursor doesn't update, and the same record will be retried until the send
succeeds.

When the sending goroutine clears out one queue file and moves on to the next
one, it is supposed to advance the cursor to the next file and *delete* the
file that was just finished. You didn't mention explicitly whether the queue
files are being automatically deleted as they're drained. Are they?

While this is happening, Heka is keeping track of the size of the disk queue. When a message is added to the
queue, the size increases. When a file is drained and deleted, the size goes down. This is all fine unless
and until the size of the queue hits the specified max size, then the behavior is specified by
`queue_full_action`. The "shutdown" option is self-explanatory, Heka shuts down. The
"drop" option means that the intake goroutine just drops the message on the floor. Messages keep
flowing, but they don't get added to the queue, they never will. The "block" option means that the
plugin stops pulling from the input channel altogether. The channel backs up, eventually blocking the router,
traffic stops flowing through Heka until there's room for the queue to grow again.

In both the "drop" and "block" case, correct recovery depends on Heka being
able to continue processing the buffer. As records get processed and queue files are drained,
they're deleted. This will push the queue size below the maximum size, which then in turn means the
intake goroutine can once again start appending to the end of the queue.

If you delete queue files out from under the output, things will get weird. The
output goroutine will probably get confused, because the file handle it's
holding no longer points to a valid file. Also, Heka won't know to subtract
that file size from the queue size, so until you do a restart (which causes
Heka to scan through the queue and recalculate its total size) the queue will
always seem bigger than it actually is.

One thing that comes up for me is it's weird how your queue is filling up so
quickly. Why is that happening? The buffer is meant to allow Heka to survive
small amounts of downtime or disconnect without losing any data, or to handle
short burst spikes. It's not magic; if the data is continually coming in more
quickly than it can go out, then you're going to have a problem, no matter
what, disk queuing is only going to delay the inevitable.

Does this help at all?

-r


Thanks

Chris

[hekad]
maxprocs = 2

[test_com]
type = "LogstreamerInput"
log_directory = "/export/test/apache2/test_com"
file_match = '/(?P<Year>)\d+/(?P<Month>\d+)_(?P<Day>\d+)_access\.log'
priority = ["Year", "Month", "Day"]

[TcpOutput]
address = "10.10.10.1:5565"
message_matcher = "TRUE"

# 30 seconds
ticker_interval=30

# 1 GB
queue_max_buffer_size=1073741824
queue_full_action="drop"


_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka


_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka

Re: [heka] TcpInput queue behavior

Reply via email to