[heka] Native InfluxDB Output Plugin

Giordano, J C. Tue, 25 Aug 2015 14:17:10 -0700

Heka community:

I would like to share my experiences with using Heka to parse Apache log files 
for insertion into InfluxDB.  My initial testing & configuration started with 
an out of the box configuration consisting of:


1) Heka (v0.11) & InfluxDB (v0.9.3) both running on a single server: Ubuntu 
14.04.2 LTS, trusty
2) A single Apache access log that I read from the local file system having ~ 
850K log entries.
3) A Heka configuration of: LogstreamerInput -> SandboxDecoder: 
apache_access.lua ->  SandboxEncoder: schema_influx_line.lua -> HttpOutput to 
InfluxDB

The performance of this configuration was unsuitable for production taking over 
12 full hours to complete the processing of a single log file.  In comparison 
to using a LogOutput which completed in approximately 3 minutes it was clear I 
needed to batch write records to InfluxDB.

My initial attempt to batch records via Lua was a weak effort and ultimately 
unsuccessful.  Attempting to queue records into a Lua table (likely the 
incorrect approach) lead to out of memory errors by the Lua sandbox for batch 
sizes exceeding ~200 messages.  Moreover, my HttpOutput then started generating 
timeout errors communicating with InfluxDB.  Not being well versed in Lua & 
having to develop in a sandbox environment without the aid of any meaningful 
logging capabilities, this approach was way too unproductive for me to continue 
developing or debugging further.

My second attempt uses a native InfluxDB output plugin I created that is based 
on the existing ElasticSearchOutput plugin having the ability to batch write 
records via HTTP.  Changing HttpOutput in the above initial configuration to 
this new plugin has altered the performance dramatically.   I’m now able to 
process a single Apache access log in ~ 4 minutes.  And, I’ve loaded 31 days of 
historical Apache logs through Heka -> InfluxDB in under 2 hours.  The number 
of records I’ve imported exceeds 36 million for each of three distinct time 
series for a total sum exceeding 108 million records.  The performance of this 
has far exceeded our expectations and we are now running Heka on a production 
server.  There’s no appreciable CPU load for doing this and we’re able to write 
directly to InfluxDB, thus eliminating the need for log shippers to a central 
server as was required with Logstash.

I have three requests:

1)  I would greatly appreciate having a native InfluxDB output plugin included 
with future releases of Heka and would like to contribute my work for your 
review and consideration.  Whether a separate plugin exists for 
ElasticSearch/InfluxDB or whether a generalized BatchHttpOutput plugin emerges 
is worth considering.  The difference between the ElasticSearchOutput plugin 
and my modified InfluxDB plugin is largely minimal.  First, the ElasticSearch 
plugin assumes a fixed endpoint (/_bulk) whereas InfluxDB relies on a query 
string.  Second, ElasticSearch returns a JSON response whereas InfluxDB returns 
an HTTP status of 204 - no content.  Both ElasticSearch & InfluxDB support TLS 
& UDP though I’ve not tested either of these features with InfluxDB.  
Differences beyond these are minor.

2) I’ve found one problem with my output plugin that appears unrelated to my 
changes or InfluxDB and most likely exists for ElasticSearch as well.

While using the LogstreamerInput to read a single file & using my InfluxOutput 
(c.f. ElasticSearchOutput) with: 'use_buffering = true’, everything works fine.

When using the LogstreamerInput to read multiple files having a file match 
pattern/priority I have to turn off buffering or I receive the following errors:

2015/08/24 14:56:43 Diagnostics: 1 packs have been idle more than 120 seconds.
2015/08/24 14:56:43 Diagnostics: (input) Plugin names and quantities found on 
idle packs:

From a previous discussion, it would appear there’s a deadlock occurring.  
Please advise on how to debug this further.

3)  While attempting to develop customizations via the Lua Sandbox, the only 
practical logging facility I could use was: add_to_payload().  But, that was 
out of scope from lua_modules/.  I would like to know how best to relax the 
sandbox restrictions to gain access to the Lua IO library for being able to 
capture output to stdio/log files.  Or, in general what advise do you offer on 
how best to develop/debug code via the Lua Sandbox?

Thanks,

Chris

_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka

[heka] Native InfluxDB Output Plugin

Reply via email to