We had configured large batches (8192) for the main queue, action queue, and udp input module. However, after a brief debugging session, it appears that at least two of these settings are not being honored...
The database action queue seems to be deleting maximum batch sizes of 8 at a time: 7968.410369112:SecInfra2ArchiveDB queue[DA]:Reg/w0: regular consumer finished, iret=0, szlog 235270 sz phys 235278 7968.410545345:SecInfra2ArchiveDB queue[DA]:Reg/w0: DeleteProcessedBatch: we deleted 8 objects and enqueued 0 objects 7968.410554591:SecInfra2ArchiveDB queue[DA]:Reg/w0: doDeleteBatch: delete batch from store, new sizes: log 235270, phys 235270 And the main queue appears to be dequeueing in maximum chunks of 256: 7963.597767589:main Q:Reg/w0 : END batch execution phase, entering to commit phase [processed 25 of 25 messages] 7963.616701318:main Q:Reg/w0 : END batch execution phase, entering to commit phase [processed 38 of 38 messages] 7963.669420594:main Q:Reg/w0 : END batch execution phase, entering to commit phase [processed 224 of 224 messages] 7963.731337893:main Q:Reg/w0 : END batch execution phase, entering to commit phase [processed 256 of 256 messages] 7963.800962451:main Q:Reg/w0 : END batch execution phase, entering to commit phase [processed 256 of 256 messages] 7963.857743064:main Q:Reg/w0 : END batch execution phase, entering to commit phase [processed 256 of 256 messages] 7963.904639838:main Q:Reg/w0 : END batch execution phase, entering to commit phase [processed 256 of 256 messages] 7963.985528145:main Q:Reg/w0 : END batch execution phase, entering to commit phase [processed 256 of 256 messages] ... I'm starting to think some of the options in our config file are incorrect or no longer supported...? Also, is there any specific keywords I should be looking for in the debug log? Or any specific debug options I should be turning on (i.e. LogFuncFlow, PrintMutexAction, OutputTidToStderr, etc...)? On Mon, Nov 6, 2017 at 4:54 PM, David Lang <[email protected]> wrote: > On Mon, 6 Nov 2017, Steven Libby wrote: > > I've tried removing the ruleset altogether and just using an action to >> minimize locking overhead (which I assume would be the same thing you're >> referring to here). But just to make sure, I also tried changing the >> ruleset queue type to "Direct" and got the same result. >> > > yes > > Also, most of the time during my testing I used resumeRetryCount="-1" but >> mistakenly left it at "0" when I posted to this mailing list. Reverting >> back to "-1" still results in a backlog in the database action queue. >> When >> you say "...some insert attempts...are not being accepted by mysql...", >> are >> you implying that some insert attempts are actually being rejected by >> mysql? >> > > that was one possibility when I thought you were not retrying logs. > > I have enabled the general query log with the highest level of >> verbosity, and also installed the audit plugin, but I am unable to see any >> indication that any insert attempts are being rejected. I have yet to >> enable the DBUG package yet for further investigation, but will try that >> next to see if I can glean any new information. >> > > are logs getting inserted one at a time or in large batches? try to > increase batching, get a debug log for a bit on the rsyslog side and see if > it's having to make multiple attempts to insert messages (for example, you > try to deliver up to 8K messages at once, rsyslog could be rejecting this > because the total request size is too large and rsyslog then tries to > deliver 4K, 2K, 1K... messages until it gets small enough to be accepted) > > David Lang > > > On a side note, I created a separate test table using the MyISAM storage >> engine and the insert rate was improved, but still filled up the database >> action queue within several minutes. I've reduced the durability of >> TokuDB >> by disabling tokudb_commit_sync and tokudb_fsync_log_period global >> variables, but there was zero noticeable performance difference here. >> (Background: we use TokuDB for it's compression and secondary clustering >> indexes, mainly). I would expect to see excessive CPU utilization or Disk >> I/O if the database was struggling to keep up with the rsyslog queue, but >> perhaps that expectation is incorrect? >> >> Regards, >> Steven >> >> On Sat, Nov 4, 2017 at 5:03 PM, David Lang <[email protected]> wrote: >> >> don't put a queue on the ruleset and the only item in the ruleset >>> >>> when you have to queues like this, you end up with lots of locking >>> overhead and poor batch sizing. >>> >>> with retrycount=0 you never attempt to re-deliver a message that fails, >>> if >>> you set it to -1 it will retry forever >>> >>> try setting it to -1 and see if logs are getting stuck (delivery stops), >>> if so, you are getting some insert attempts that are not being accepted >>> by >>> mysql and since you have retries disabled, the entire batch is being >>> thrown >>> away. >>> >>> >>> On Fri, 3 Nov 2017, Steven Libby via rsyslog wrote: >>> >>> Date: Fri, 3 Nov 2017 14:53:17 -0400 >>> >>>> From: Steven Libby via rsyslog <[email protected]> >>>> To: [email protected] >>>> Cc: Steven Libby <[email protected]> >>>> Subject: [rsyslog] ommysql database queue bottleneck >>>> >>>> >>>> We are running rsyslog 8.21.0 on a RHEL 6.7 box with 256GB of RAM and 96 >>>> cores, writing to a MySQL instance (Percona Server 5.6.28-76) using >>>> TokuDB >>>> as the storage engine. This is our centralized log server, which >>>> receives >>>> messages from numerous network devices at an average rate of 7k messages >>>> per second. This message rate is fairly consistent throughout the >>>> course >>>> of the day. >>>> >>>> We noticed records missing from the database when comparing with the raw >>>> records being written directly to a file on another server. Based on >>>> the >>>> investigation so far, it appears the lost messages are being discarded >>>> due >>>> to queues filling up. >>>> >>>> After enabling the impstats module, we discovered that the bottleneck >>>> appears to be in the database queue; but we're having trouble >>>> determining >>>> the exact cause of the bottleneck. Database resource utilization is >>>> relatively low (CPU utilization < 4%, Disk I/O < 1%, 5+GB free RAM) and >>>> we've tuned the database configuration to remove any unnecessary >>>> overhead. >>>> We've made numerous configuration changes, both on the DB side and in >>>> rsyslogd, and the queue seems to build at roughly the same rate >>>> regardless >>>> of of everything we've tried. >>>> >>>> We are successfully writing 3k-4k records per-second into the DB, but >>>> the >>>> action queue is also building at about that same rate. >>>> >>>> Here is the latest rsyslog configuration we've tried (below) cleaned up >>>> for >>>> the purpose of this message. Most of the incoming messages arrive via >>>> imudp. We've tried adjusting the number of threads in that module, as >>>> well >>>> as the batchSize, and the number of messages coming in stays about the >>>> same, regardless. We've also tried different combinations of MainMsg >>>> queue >>>> worker threads and batch sizes, as well as similar adjustments in the >>>> ruleset and action configurations. We've gone through the changelogs up >>>> to >>>> rsyslog 8.31.0 and didn't see anything to indicate that upgrading would >>>> help with the performance. We've also gone through most of the online >>>> documentation at rsyslog.com/doc/v8-stable/ and read some articles by >>>> David >>>> Long regarding performance improvements and the work he did with Intuit. >>>> >>>> At this point, any additional help or guidance from the professionals >>>> would >>>> be greatly appreciated. Thanks for taking the time to read this... >>>> >>>> >>>> #rsyslog v8.x config file >>>> $ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat >>>> $RepeatedMsgContainsOriginalMsg on >>>> $OptimizeForUniprocessor off >>>> $PreserveFQDN on >>>> $WorkDirectory /HSD1/syslogQ >>>> >>>> $MainMsgQueueFileName mainq-SecInfra >>>> $MainMsgQueueMaxDiskSpace 200g >>>> $MainMsgQueueMaxFileSize 1g >>>> $MainMsgQueueSaveOnShutdown on >>>> $MainMsgQueueType LinkedList >>>> $MainMsgQueueSize 10485760 >>>> $MainMsgQueueHighWatermark 10000000 >>>> $MainMsgQueueWorkerThreads 8 >>>> $MainMsgDequeueBatchSize 8192 >>>> >>>> module(load="imudp" threads="8" batchSize="8192" timeRequery="10") >>>> module(load="imptcp" threads="8" processOnPoller="off" ) >>>> module(load="imuxsock") >>>> module(load="imklog") >>>> module(load="immark") >>>> module(load="ommysql") >>>> module(load="impstats" interval="5" resetCounters="off" format="legacy" >>>> log.syslog="off" log.file="/HSD1/locallog/pstat >>>> s-SecInfra_2017-11-03.log") >>>> module(load="mmutf8fix") >>>> >>>> $template SecInfraSyslogTmpl,"insert into SecInfraSyslog.syslogRaw >>>> (recievedAt,deviceReportedTime,deviceTimezone,fromHost, >>>> fromHostIpv4,hostName,facility,priority,localPriority,infoUnitId, >>>> inputName,syslogTag,programName,appName,procId,msgId, >>>> message,rawMessage) >>>> values ('%timegenerated:::date-mysql%.%timegenerated:::date- >>>> subseconds%','%timereported:::date-mysql%.%timereported::: >>>> date-subseconds%','%timereported:::date-tzoffsdirection%%timereported: >>>> ::date-tzoffshour%','%FROMHOST%','%fromhost-ip%','% >>>> HOSTNAME%',%syslogfacility%,%syslogpriority%,'%PRI%',%iut%, >>>> '%inputname%','%syslogtag%','%programname%','%APP-NAME%','% >>>> PROCID%','%MSGID%','%msg%','%rawmsg%')",SQL >>>> >>>> ruleset(name="SecInfra" >>>> queue.saveonshutdown="on" >>>> queue.type="LinkedList" >>>> queue.filename="SecInfra-ruleQ" >>>> queue.highwatermark="1000000" >>>> queue.size="1310720" >>>> queue.dequeuebatchsize="8192" >>>> queue.maxdiskspace="200g" >>>> queue.maxfilesize="10g" >>>> queue.workerthreads="2" >>>> ) { >>>> action(name="SecInfra2ArchiveDB" >>>> type="ommysql" >>>> server="127.0.0.1" >>>> db="SecInfraSyslog" >>>> uid="rsyslog" >>>> pwd="************" >>>> serverport="3306" >>>> template="SecInfraSyslogTmpl" >>>> queue.saveonshutdown="on" >>>> queue.type="LinkedList" >>>> queue.filename="SecInfraArchive-dbQ" >>>> queue.highwatermark="1000000" >>>> queue.size="1310720" >>>> queue.dequeuebatchsize="8192" >>>> queue.maxdiskspace="200g" >>>> queue.maxfilesize="10g" >>>> queue.workerthreads="2" >>>> action.resumeRetryCount="0" >>>> action.reportSuspension="on" >>>> action.reportSuspensionContinuation="on" >>>> ) >>>> } >>>> >>>> input(type="imudp" name="SecInfraUDP" address="************" port="514" >>>> ruleset="SecInfra") >>>> input(type="imtcp" name="SecInfraTCP" address="************" port="514" >>>> ruleset="SecInfra") >>>> >>>> Regards, >>>> Steven Libby >>>> _______________________________________________ >>>> rsyslog mailing list >>>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>>> http://www.rsyslog.com/professional-services/ >>>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >>>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you >>>> DON'T LIKE THAT. >>>> >>>> >>>> >> _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

