Re: [heka] TcpInput performance

Andre Wed, 23 Sep 2015 20:56:25 -0700

Rob,


>> TcpInput -> TokenSplitter -> PayloadRegexDecoder -> ESJsonEncoder ->
>> FileOutput
>
> I don't understand why you'd want to use ESJsonEncoder with a FileOutput. 
> What's going on there?

You are right... "that makes no sense!" (Cochran, 7th October 1998)

Just to clarify: The desired pipeline will have FileOutput replaced
for KafkaOutput or AMQPOutput. No sure yet what will be at the other
side of the Message Bus feeding ES yet. It could be Heka or it could
be LogStash. Doesn't truly mater at this stage.

In any case, I observed the below expected performance I tried to
reduce the number of other variables to the test (e.g. I didn't want
to point at heka for problems caused by an underscpec'd kafka broker)
and as consequence set the output to disk.

Hope the pipeline make more sense now.


> Also, PayloadRegexDecoder is known to be slow and unwieldy. If you're doing 
> any parsing at all, you can get much better performance with a SandboxDecoder 
> that uses an LPEG grammar than you can with a PayloadRegexDecoder.

Agreed. More on that  below.


>
>> 2) I should give it a go and remove TokenSplitter from the pipeline
>
> Why would you remove TokenSplitter? If you want to split the records on 
> newlines, then TokenSplitter is what you want.

I knew this would break the pipeline, but also allowed me to pinpoint
performance to the PayloadRegexDecoder. (As mentioned before, I try to
isolate components while troubleshooting performance).


>> So  on I guess the bottleneck is either in the Splitter or the Regex
>> decoder (though the regex is a simple '^(?P<Payload>.*)' )
>
> PayloadRegexDecoder is generally pretty slow. And what's the point of using a 
> PayloadRegexDecoder
> when you're not even doing any decoding? If you want to pass the payload 
> through without any parsing,
> then don't use a decoder at all.


I tried this (no decoder) but couldn't get it to work... (using the
following config):


---- START ----

[TcpInput]

address = "127.0.0.1:5565"

splitter = "newline_splitter"


[newline_splitter]

type = "TokenSplitter"

delimiter = '\n'


[PayloadEncoder]

append_newlines = true


[fileout]

type = "FileOutput"

message_matcher = "TRUE"

path = "/tmp/message-output.log"

perm = "666"

flush_count = 100

flush_operator = "OR"

encoder = "PayloadEncoder"

---- END ----


If I understand correctly, TcpInput defaults to ProtobufDecoder and as
such the syslog messages are split (using \n) but will not get decoded
(they are not ProtoBuf messages after all).


So my next step was to add a PayloadRegexDecoder to get the messages
decoded to the bare minimum, however, the side effect was a slower
pipeline.


In any case... after reading your previous message around LPEG, it
seems the best call would have been to use a SandBoxDecoder instead of
Regex...

And indeed I get a significantly better performance with a
SandboxDecoder like this:


---- START ----

require 'lpeg'

local l = lpeg

l.locale(l)

local line = (l.P(1))^0

grammar = l.C(line)

local msg = {

    Payload     = nil,

}

function process_message ()

    local payload = read_message("Payload")

    local m = grammar:match(payload)

    if m then

        msg.Payload = m

        inject_message(msg)

        return 0

    end

    return -1

end

---- END ----


As you suggested the result is much better than the one previously obtained:

$ /tmp/loggen -r 50000 -i -S -s 256 localhost 5565
average rate = 24015.29 msg/sec, count=240203, time=10.020, (last) msg
size=256, bandwidth=6003.82 kB/sec


I also reintroduced the ES JSON encoding and as you expected the
performance dropped:


$ /tmp/loggen -r 50000 -i -S -s 256 localhost 5565
average rate = 14478.81 msg/sec, count=144816, time=10.019, (last) msg
size=256, bandwidth=3619.70 kB/sec


However, the ES performance was still higher than the one I achieved
using the PayloadRegexDecoder:


$ /tmp/loggen -r 50000 -i -S -s 256 localhost 5565
average rate = 12663.89 msg/sec, count=126763, time=10.098, (last) msg
size=256, bandwidth=3165.97 kB/sec

I also played with larger poolsize values but couldn't get any extra
juice from changing it.

Meanwhile, increasing the value of plugin_chansize to the same value
of poolsize seems to bring the EPS count closer to 30K EPS but I'm
still not 100% sure about the consequences of this... :D


Current config  looks like:

---- START ----
[hekad]
maxprocs = 4
poolsize = 100
# get some extra EPS by increasing chansize.
#plugin_chansize = 100
base_dir = "/tmp/hekad/"

[TcpInput]
address = "127.0.0.1:5565"
splitter = "newline_splitter"
decoder = "SandboxDecoder"

[newline_splitter]
type = "TokenSplitter"
delimiter = '\n'

[SandboxDecoder]
filename = "/tmp/hekad/line.lua"

[PayloadEncoder]
append_newlines = false

[fileout]
type = "FileOutput"
message_matcher = "TRUE"
path = "/tmp/message-output.log"
perm = "666"
flush_count = 100
flush_operator = "OR"
encoder = "PayloadEncoder"
---- END ----



Cheers
_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka

Re: [heka] TcpInput performance

Reply via email to