On 09/23/2015 08:55 PM, Andre wrote:
Rob,

TcpInput -> TokenSplitter -> PayloadRegexDecoder -> ESJsonEncoder ->
FileOutput

I don't understand why you'd want to use ESJsonEncoder with a FileOutput. 
What's going on there?

You are right... "that makes no sense!" (Cochran, 7th October 1998)

Just to clarify: The desired pipeline will have FileOutput replaced
for KafkaOutput or AMQPOutput. No sure yet what will be at the other
side of the Message Bus feeding ES yet. It could be Heka or it could
be LogStash. Doesn't truly mater at this stage.

I'd probably send the protobuf encoded data over the transport, and then use Heka to feed ES, holding off on encoding to the JSON that ElasticSearch wants until the last moment.

In any case, I observed the below expected performance I tried to
reduce the number of other variables to the test (e.g. I didn't want
to point at heka for problems caused by an underscpec'd kafka broker)
and as consequence set the output to disk.

Hope the pipeline make more sense now.


Also, PayloadRegexDecoder is known to be slow and unwieldy. If you're doing any 
parsing at all, you can get much better performance with a SandboxDecoder that 
uses an LPEG grammar than you can with a PayloadRegexDecoder.

Agreed. More on that  below.



2) I should give it a go and remove TokenSplitter from the pipeline

Why would you remove TokenSplitter? If you want to split the records on 
newlines, then TokenSplitter is what you want.

I knew this would break the pipeline, but also allowed me to pinpoint
performance to the PayloadRegexDecoder. (As mentioned before, I try to
isolate components while troubleshooting performance).


So  on I guess the bottleneck is either in the Splitter or the Regex
decoder (though the regex is a simple '^(?P<Payload>.*)' )

PayloadRegexDecoder is generally pretty slow. And what's the point of using a 
PayloadRegexDecoder
when you're not even doing any decoding? If you want to pass the payload 
through without any parsing,
then don't use a decoder at all.

I tried this (no decoder) but couldn't get it to work... (using the
following config):

Ah, right, I forgot that since the TcpInput defaults to using a ProtobufDecoder, leaving out the decoder setting will give you the wrong results. You can work around this by explicitly setting the decoder value to an empty string:

[TcpInput]
address = "127.0.0.1:5565"
splitter = "TokenSplitter"
decoder = ""




---- START ----

[TcpInput]
address = "127.0.0.1:5565"
splitter = "newline_splitter"


[newline_splitter]
type = "TokenSplitter"
delimiter = '\n'

This is correct, but FYI a TokenSplitter with this configuration is automatically defined for you and is available using the name "TokenSplitter".

[PayloadEncoder]
append_newlines = true

[fileout]
type = "FileOutput"
message_matcher = "TRUE"
path = "/tmp/message-output.log"
perm = "666"
flush_count = 100
flush_operator = "OR"
encoder = "PayloadEncoder"

---- END ----


If I understand correctly, TcpInput defaults to ProtobufDecoder and as
such the syslog messages are split (using \n) but will not get decoded
(they are not ProtoBuf messages after all).

Right, you have to use the workaround I described above.

So my next step was to add a PayloadRegexDecoder to get the messages
decoded to the bare minimum, however, the side effect was a slower
pipeline.


In any case... after reading your previous message around LPEG, it
seems the best call would have been to use a SandBoxDecoder instead of
Regex...

And indeed I get a significantly better performance with a
SandboxDecoder like this:

---- START ----

require 'lpeg'
local l = lpeg
l.locale(l)

local line = (l.P(1))^0
grammar = l.C(line)
local msg = {
     Payload     = nil,
}

function process_message ()
     local payload = read_message("Payload")
     local m = grammar:match(payload)
     if m then
         msg.Payload = m
         inject_message(msg)
         return 0
     end
     return -1
end

---- END ----

While this is faster than the "null" PayloadRegexDecoder, it's still doing a great deal of work for no value. Pretty sure you can get rid of all of this code and simply use the following:

----START----
function process_message()
    return 0
end
----END----

But, again, even this is moot, you should explicitly specify no decoder.

As you suggested the result is much better than the one previously obtained:

$ /tmp/loggen -r 50000 -i -S -s 256 localhost 5565
average rate = 24015.29 msg/sec, count=240203, time=10.020, (last) msg
size=256, bandwidth=6003.82 kB/sec

Try with no decoder, see what you end up with.


I also reintroduced the ES JSON encoding and as you expected the
performance dropped:


$ /tmp/loggen -r 50000 -i -S -s 256 localhost 5565
average rate = 14478.81 msg/sec, count=144816, time=10.019, (last) msg
size=256, bandwidth=3619.70 kB/sec


However, the ES performance was still higher than the one I achieved
using the PayloadRegexDecoder:


$ /tmp/loggen -r 50000 -i -S -s 256 localhost 5565
average rate = 12663.89 msg/sec, count=126763, time=10.098, (last) msg
size=256, bandwidth=3165.97 kB/sec

I also played with larger poolsize values but couldn't get any extra
juice from changing it.

Meanwhile, increasing the value of plugin_chansize to the same value
of poolsize seems to bring the EPS count closer to 30K EPS but I'm
still not 100% sure about the consequences of this... :D

Pool size and channel size aren't likely to change much here. Tweaking them is useful in certain specific situations, but this is usually when you have a bunch of filter and output plugins and you're bumping up against all of the available packs being used at the same time.

Hope this helps,

-r




Current config  looks like:

---- START ----
[hekad]
maxprocs = 4
poolsize = 100
# get some extra EPS by increasing chansize.
#plugin_chansize = 100
base_dir = "/tmp/hekad/"

[TcpInput]
address = "127.0.0.1:5565"
splitter = "newline_splitter"
decoder = "SandboxDecoder"

[newline_splitter]
type = "TokenSplitter"
delimiter = '\n'

[SandboxDecoder]
filename = "/tmp/hekad/line.lua"

[PayloadEncoder]
append_newlines = false

[fileout]
type = "FileOutput"
message_matcher = "TRUE"
path = "/tmp/message-output.log"
perm = "666"
flush_count = 100
flush_operator = "OR"
encoder = "PayloadEncoder"
---- END ----



Cheers

_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka

Reply via email to