Hi David and Rainer, I have read the thread below and caught up on all of the list traffic. Regarding the performance implications of the escaping, it is certainly not a concern on the low end of messages per second. I mentioned it because if you use PQexecParams or PQexecPrepared you can avoid the tedious and error-prone quoting and escaping process on both the rsyslog and the DB backend.
Obviously, as this thread shows, using multiple inserts per transaction is much, much faster than using a single insert per transaction: http://archives.postgresql.org/pgsql-performance/2006-06/msg00381.php It is also useful to remember that the number of round-trips from the application to the DB will also slow down multiple inserts per transaction, i.e. sending begin;insert xxx values yyy;insert xxx values zzz;...;commit; (send) will be yet again faster than: begin; (send) insert xxx values yyy; (send) insert xxx values zzz; (send) ... comit; (send) I also agree with your assessment that having you do not need a lot of granularity in the grouping. If you have light traffic the current 1 insert per transaction is fine and if you have heavy logging, grouping to a single larger size is sufficient and would also reduce the complexity of the implimentation. Regards, Ken On Tue, Apr 21, 2009 at 11:52:02PM -0700, [email protected] wrote: > On Wed, 22 Apr 2009, Rainer Gerhards wrote: > > >> -----Original Message---- > >> From: [email protected] [mailto:rsyslog- > >> [email protected]] On Behalf Of [email protected] > >> Sent: Wednesday, April 22, 2009 8:26 AM > >> To: rsyslog-users > >> Subject: Re: [rsyslog] [PERFORM] performance for high-volume log > >> insertion > >> > >> On Wed, 22 Apr 2009, Rainer Gerhards wrote: > >> > >>> Hi Ken, > >>> > >>> glad to have you here. I am a bit silent at the moment, because I am > >> not a > >>> real database guy and so I am primarily listening to any information > >> that is > >>> incoming. If you have a couple of minutes, it would be useful to > >> review this > >>> thread here: > >>> > >>> http://lists.adiscon.net/pipermail/rsyslog/2009-April/002003.html > >>> > >>> ...one comment inline below... > >>> > >>>> -----Original Message----- > >>>> From: [email protected] [mailto:rsyslog- > >>>> [email protected]] On Behalf Of Kenneth Marshall > >>>> > >>>> David, > >>>> > >>>> Okay, I am now subscribed to the mailing list. We are currently > >> using > >>>> rsyslog-3.20.x. As far as implementing a prototype of the prepared > >>>> statement, I was sidetracked by other duties and have not had a > >> chance > >>>> to do anything but an initial evaluation. As far as the rsyslog > >>>> internal > >>>> escaping, it looked simplest to create another template type like > >> the > >>>> current SQL and STDSQL that indicated that escaping was not needed > >>>> and/or that prepared statements should be used. > >>> > >>> To disable escaping, simply do not use SQL or STDSQL. However, the db > >> outputs > >>> currently require this option (easy to disable), because I cannot see > >> how it > >>> will work (with the existing code) without escaping. > >>> > >>> Any idea is most welcome. > >> > >> when using prepared statement escaping is not needed. > > > > ... but that can only be on the non-text API level (e.g. by using libpq). On > > the SQL text level, I have no idea how to tell the sql engine to insert ' - > > if I don't say '''' but ''' how does the engine know what I say? With the > > C-level API, I bind a parameter and specify a buffer and then put my > > character into that buffer - no escaping needed for sure. But, again, I do > > not see how this would work on the text level... > > correct. > > >> according to > >> Ken's > >> message below he found that the overhead of doing the escaping was > >> significant. I don't see why this should be the case, but if it > >> requires > >> making an extra copy of the string I guess it's possible > >> > > > > I am surprised, too. Even if a copy is made (I think it is), this is a quick > > in-memory operation. Given the rest of the picture, I would expect that to > > have very low impact on the overall cost (just think about the need to copy > > buffers between different contexts, down to different layers, etc - so I'd > > expect to see ample copy operations before the data finally hits the disk). > > Anyhow, I may be totally wrong... > > > >> I plan to get a setup togeather in the next couple of days that will > >> let > >> me do some testing of the options on the database side. > >> > > > > That would be great. I, for now, intend to look at the queue first. I have > > begun to thought about steps on how to tackle the beast. So I will probably > > not do much more on the database level than throw in some thoughts (but not > > do any testing or coding). > > sounds good. > > David Lang > > > Rainer > > > >> David Lang > >> > >>> Rainer > >>>> > >>>> Regards, > >>>> Ken > >>>> > >>>> On Tue, Apr 21, 2009 at 08:51:37AM -0700, [email protected] wrote: > >>>>> On Tue, 21 Apr 2009, Kenneth Marshall wrote: > >>>>> > >>>>>> On Tue, Apr 21, 2009 at 08:37:54AM -0700, [email protected] wrote: > >>>>>>> Kenneth, > >>>>>>> could you join the discussion on the rsyslog mailing list? > >>>>>>> rsyslog-users <[email protected]> > >>>>>>> > >>>>>>> I'm surprised to hear you say that rsyslog can already do batch > >>>> inserts > >>>>>>> and > >>>>>>> am interested in how you did that. > >>>>>>> > >>>>>>> what sort of insert rate did you mange to get? > >>>>>>> > >>>>>>> David Lang > >>>>>>> > >>>>>> David, > >>>>>> > >>>>>> I would be happy to join the discussion. I did not mean to say > >>>>>> that rsyslog currently supported batch inserts, just that the > >>>>>> pieces that provide "stand-by queuing" could be used to manage > >>>>>> batching inserts. > >>>>> > >>>>> I've changed the to list to the rsyslog users list. > >>>>> > >>>>> currently the stand-by queuing still handles messages one at a > >> time. > >>>>> however a sponser has been found to pay to changing the rsyslog > >>>> internals > >>>>> to allow for multiple messages to be handled at once, which is what > >>>>> triggered some of this discussion. > >>>>> > >>>>> which version of rsyslog are you working with? > >>>>> > >>>>> when you modified rsyslog to do prepared statement (to avoid the > >>>> escaping > >>>>> and parsing) did you hard-code the prepared statement? what other > >>>> changes > >>>>> did you make? > >>>>> > >>>>> David Lang > >>>>> > >>>>>> Cheers, > >>>>>> Ken > >>>>>> > >>>>>>> On Tue, 21 Apr 2009, Kenneth Marshall wrote: > >>>>>>> > >>>>>>>> Date: Tue, 21 Apr 2009 08:33:30 -0500 > >>>>>>>> From: Kenneth Marshall <[email protected]> > >>>>>>>> To: Richard Huxton <[email protected]> > >>>>>>>> Cc: [email protected], Stephen Frost <[email protected]>, > >>>>>>>> Greg Smith <[email protected]>, pgsql- > >>>> [email protected] > >>>>>>>> Subject: Re: [PERFORM] performance for high-volume log insertion > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> I just finished reading this thread. We are currently working on > >>>>>>>> setting up a central log system using rsyslog and PostgreSQL. It > >>>>>>>> works well once we patched the memory leak. We also looked at > >> what > >>>>>>>> could be done to improve the efficiency of the DB interface. On > >>>> the > >>>>>>>> rsyslog side, moving to prepared queries allows you to remove > >> the > >>>>>>>> escaping that needs to be done currently before attempting to > >>>>>>>> insert the data into the SQL backend as well as removing the > >>>> parsing > >>>>>>>> and planning time from the insert. This is a big win for high > >>>> insert > >>>>>>>> rates, which is what we are talking about. The escaping process > >> is > >>>>>>>> also a big CPU user in rsyslog which then hands the escaped > >> string > >>>>>>>> to the backend which then has to undo everything that had been > >>>> done > >>>>>>>> and parse/plan the resulting query. This can use a surprising > >>>> amount > >>>>>>>> of additional CPU. Even if you cannot support a general prepared > >>>>>>>> query interface, by specifying what the query should look like > >> you > >>>>>>>> can handle much of the low-hanging fruit query-wise. > >>>>>>>> > >>>>>>>> We are currently using a date based trigger to use a new > >> partition > >>>>>>>> each day and keep 2 months of logs currently. This can be > >> usefully > >>>>>>>> managed on the backend database, but if rsyslog supported > >> changing > >>>>>>>> the insert to the new table on a time basis, the CPU used by the > >>>>>>>> trigger to support this on the backend could be reclaimed. This > >>>>>>>> would be a win for any DB backend. As you move to the new > >>>> partition, > >>>>>>>> issuing a truncate to clear the table would simplify the DB > >>>> interfaces. > >>>>>>>> > >>>>>>>> Another performance enhancement already mentioned, would be to > >>>>>>>> allow certain extra fields in the DB to be automatically > >> populated > >>>>>>>> as a function of the log messages. For example, logging the mail > >>>> queue > >>>>>>>> id for messages from mail systems would make it much easier to > >>>> locate > >>>>>>>> particular mail transactions in large amounts of data. > >>>>>>>> > >>>>>>>> To sum up, eliminating the escaping in rsyslog through the use > >> of > >>>>>>>> prepared queries would reduce the CPU load on the DB backend. > >>>> Batching > >>>>>>>> the inserts will also net you a big performance increase. Some > >> DB- > >>>> based > >>>>>>>> applications allow for the specification of several types of > >>>> queries, > >>>>>>>> one for single inserts and then a second to support multiple > >>>> inserts > >>>>>>>> (copy). Rsyslog already supports the queuing pieces to allow you > >>>> to > >>>>>>>> batch inserts. Just some ideas. > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> Ken > >>>>>>>> > >>>>>>>> > >>>>>>>> On Tue, Apr 21, 2009 at 09:56:23AM +0100, Richard Huxton wrote: > >>>>>>>>> [email protected] wrote: > >>>>>>>>>> On Tue, 21 Apr 2009, Stephen Frost wrote: > >>>>>>>>>>> * [email protected] ([email protected]) wrote: > >>>>>>>>>>>> while I fully understand the 'benchmark your situation' > >> need, > >>>> this > >>>>>>>>>>>> isn't > >>>>>>>>>>>> that simple. > >>>>>>>>>>> > >>>>>>>>>>> It really is. You know your application, you know it's > >> primary > >>>> use > >>>>>>>>>>> cases, and probably have some data to play with. You're > >>>> certainly in > >>>>>>>>>>> a > >>>>>>>>>>> much better situation to at least *try* and benchmark it than > >>>> we are. > >>>>>>>>>> rsyslog is a syslog server. it replaces (or for debian and > >>>> fedora, has > >>>>>>>>>> replaced) your standard syslog daemon. it recieves log > >> messages > >>>> from > >>>>>>>>>> every > >>>>>>>>>> app on your system (and possibly others), filters, maniulates > >>>> them, > >>>>>>>>>> and > >>>>>>>>>> then stores them somewhere. among the places that it can store > >>>> the > >>>>>>>>>> logs > >>>>>>>>>> are database servers (native support for MySQL, PostgreSQL, > >> and > >>>>>>>>>> Oracle. > >>>>>>>>>> plus libdbi for others) > >>>>>>>>> > >>>>>>>>> Well, from a performance standpoint the obvious things to do > >> are: > >>>>>>>>> 1. Keep a connection open, do NOT reconnect for each log- > >>>> statement > >>>>>>>>> 2. Batch log statements together where possible > >>>>>>>>> 3. Use prepared statements > >>>>>>>>> 4. Partition the tables by day/week/month/year (configurable I > >>>> suppose) > >>>>>>>>> > >>>>>>>>> The first two are vital, the third takes you a step further. > >> The > >>>> fourth > >>>>>>>>> is > >>>>>>>>> a long-term admin thing. > >>>>>>>>> > >>>>>>>>> And possibly > >>>>>>>>> 5. Have two connections, one for fatal/error etc and one for > >>>> info/debug > >>>>>>>>> level log statements (configurable split?). Then you can use > >> the > >>>>>>>>> synchronous_commit setting on the less important ones. Might > >> buy > >>>> you > >>>>>>>>> some > >>>>>>>>> performance on a busy system. > >>>>>>>>> > >>>>>>>>> http://www.postgresql.org/docs/8.3/interactive/runtime-config- > >>>> wal.html#RUNTIME-CONFIG-WAL-SETTINGS > >>>>>>>>> > >>>>>>>>>> other apps then search and report on the data after it is > >>>> stored. what > >>>>>>>>>> apps?, I don't know either. pick your favorite reporting tool > >>>> and > >>>>>>>>>> you'll > >>>>>>>>>> be a step ahead of me (I don't know a really good reporting > >>>> tool) > >>>>>>>>>> as for sample data, you have syslog messages, just like I do. > >> so > >>>> you > >>>>>>>>>> have > >>>>>>>>>> the same access to data that I have. > >>>>>>>>>> how would you want to query them? how would people far less > >>>>>>>>>> experianced > >>>>>>>>>> that you want to query them? > >>>>>>>>>> I can speculate that some people would do two columns (time, > >>>>>>>>>> everything > >>>>>>>>>> else), others will do three (time, server, everything else), > >> and > >>>>>>>>>> others > >>>>>>>>>> will go further (I know some who would like to extract IP > >>>> addresses > >>>>>>>>>> embedded in a message into their own column). some people will > >>>> index > >>>>>>>>>> on > >>>>>>>>>> the time and host, others will want to do full-text searches > >> of > >>>>>>>>>> everything. > >>>>>>>>> > >>>>>>>>> Well, assuming it looks much like traditional syslog, I would > >> do > >>>>>>>>> something > >>>>>>>>> like: (timestamp, host, facility, priority, message). It's easy > >>>> enough > >>>>>>>>> to > >>>>>>>>> stitch back together if people want that. > >>>>>>>>> > >>>>>>>>> PostgreSQL's full-text indexing is quite well suited to > >> logfiles > >>>> I'd > >>>>>>>>> have > >>>>>>>>> thought, since it knows about filenames, urls etc already. > >>>>>>>>> > >>>>>>>>> If you want to get fancy, add a msg_type column and one > >>>> subsidiary > >>>>>>>>> table > >>>>>>>>> for each msg_type. So - you might have smtp_connect_from > >>>> (hostname, > >>>>>>>>> ip_addr). A set of perl regexps can match and extract the > >> fields > >>>> for > >>>>>>>>> these > >>>>>>>>> extra tables, or you could do it with triggers inside the > >>>> database. I > >>>>>>>>> think > >>>>>>>>> it makes sense to do it in the application. Easier for users to > >>>>>>>>> contribute > >>>>>>>>> new patterns/extractions. Meanwhile, the core table is > >> untouched > >>>> so you > >>>>>>>>> don't *need* to know about these extra tables. > >>>>>>>>> > >>>>>>>>> If you have subsidiary tables, you'll want to partition those > >> too > >>>> and > >>>>>>>>> perhaps stick them in their own schema (logs200901, logs200902 > >>>> etc). > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Richard Huxton > >>>>>>>>> Archonet Ltd > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Sent via pgsql-performance mailing list > >>>>>>>>> ([email protected]) > >>>>>>>>> To make changes to your subscription: > >>>>>>>>> http://www.postgresql.org/mailpref/pgsql-performance > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Sent via pgsql-performance mailing list > >>>>>>> ([email protected]) > >>>>>>> To make changes to your subscription: > >>>>>>> http://www.postgresql.org/mailpref/pgsql-performance > >>>>>>> > >>>>>> > >>>>> > >>>> _______________________________________________ > >>>> rsyslog mailing list > >>>> http://lists.adiscon.net/mailman/listinfo/rsyslog > >>>> http://www.rsyslog.com > >>> _______________________________________________ > >>> rsyslog mailing list > >>> http://lists.adiscon.net/mailman/listinfo/rsyslog > >>> http://www.rsyslog.com > >>> > >> _______________________________________________ > >> rsyslog mailing list > >> http://lists.adiscon.net/mailman/listinfo/rsyslog > >> http://www.rsyslog.com > > _______________________________________________ > > rsyslog mailing list > > http://lists.adiscon.net/mailman/listinfo/rsyslog > > http://www.rsyslog.com > > > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com

