Re: [rsyslog] [PERFORM] performance for high-volume log insertion

Kenneth Marshall Wed, 22 Apr 2009 06:43:46 -0700

Hi David and Rainer,

I have read the thread below and caught up on all of the list
traffic. Regarding the performance implications of the escaping,
it is certainly not a concern on the low end of messages per
second. I mentioned it because if you use PQexecParams or
PQexecPrepared you can avoid the tedious and error-prone quoting
and escaping process on both the rsyslog and the DB backend.


Obviously, as this thread shows, using multiple inserts per
transaction is much, much faster than using a single insert
per transaction:

http://archives.postgresql.org/pgsql-performance/2006-06/msg00381.php

It is also useful to remember that the number of round-trips from
the application to the DB will also slow down multiple inserts per
transaction, i.e. sending

begin;insert xxx values yyy;insert xxx values zzz;...;commit; (send)

will be yet again faster than:

begin; (send)
insert xxx values yyy; (send)
insert xxx values zzz; (send)
...
comit; (send)

I also agree with your assessment that having you do not need
a lot of granularity in the grouping. If you have light traffic
the current 1 insert per transaction is fine and if you have
heavy logging, grouping to a single larger size is sufficient
and would also reduce the complexity of the implimentation.

Regards,
Ken

On Tue, Apr 21, 2009 at 11:52:02PM -0700, [email protected] wrote:
> On Wed, 22 Apr 2009, Rainer Gerhards wrote:
> 
> >> -----Original Message----
> >> From: [email protected] [mailto:rsyslog-
> >> [email protected]] On Behalf Of [email protected]
> >> Sent: Wednesday, April 22, 2009 8:26 AM
> >> To: rsyslog-users
> >> Subject: Re: [rsyslog] [PERFORM] performance for high-volume log
> >> insertion
> >>
> >> On Wed, 22 Apr 2009, Rainer Gerhards wrote:
> >>
> >>> Hi Ken,
> >>>
> >>> glad to have you here. I am a bit silent at the moment, because I am
> >> not a
> >>> real database guy and so I am primarily listening to any information
> >> that is
> >>> incoming. If you have a couple of minutes, it would be useful to
> >> review this
> >>> thread here:
> >>>
> >>> http://lists.adiscon.net/pipermail/rsyslog/2009-April/002003.html
> >>>
> >>> ...one comment inline below...
> >>>
> >>>> -----Original Message-----
> >>>> From: [email protected] [mailto:rsyslog-
> >>>> [email protected]] On Behalf Of Kenneth Marshall
> >>>>
> >>>> David,
> >>>>
> >>>> Okay, I am now subscribed to the mailing list. We are currently
> >> using
> >>>> rsyslog-3.20.x. As far as implementing a prototype of the prepared
> >>>> statement, I was sidetracked by other duties and have not had a
> >> chance
> >>>> to do anything but an initial evaluation. As far as the rsyslog
> >>>> internal
> >>>> escaping, it looked simplest to create another template type like
> >> the
> >>>> current SQL and STDSQL that indicated that escaping was not needed
> >>>> and/or that prepared statements should be used.
> >>>
> >>> To disable escaping, simply do not use SQL or STDSQL. However, the db
> >> outputs
> >>> currently require this option (easy to disable), because I cannot see
> >> how it
> >>> will work (with the existing code) without escaping.
> >>>
> >>> Any idea is most welcome.
> >>
> >> when using prepared statement escaping is not needed.
> >
> > ... but that can only be on the non-text API level (e.g. by using libpq). On
> > the SQL text level, I have no idea how to tell the sql engine to insert ' -
> > if I don't say '''' but ''' how does the engine know what I say? With the
> > C-level API, I bind a parameter and specify a buffer and then put my
> > character into that buffer - no escaping needed for sure. But, again, I do
> > not see how this would work on the text level...
> 
> correct.
> 
> >> according to
> >> Ken's
> >> message below he found that the overhead of doing the escaping was
> >> significant. I don't see why this should be the case, but if it
> >> requires
> >> making an extra copy of the string I guess it's possible
> >>
> >
> > I am surprised, too. Even if a copy is made (I think it is), this is a quick
> > in-memory operation. Given the rest of the picture, I would expect that to
> > have very low impact on the overall cost (just think about the need to copy
> > buffers between different contexts, down to different layers, etc - so I'd
> > expect to see ample copy operations before the data finally hits the disk).
> > Anyhow, I may be totally wrong...
> >
> >> I plan to get a setup togeather in the next couple of days that will
> >> let
> >> me do some testing of the options on the database side.
> >>
> >
> > That would be great. I, for now, intend to look at the queue first. I have
> > begun to thought about steps on how to tackle the beast. So I will probably
> > not do much more on the database level than throw in some thoughts (but not
> > do any testing or coding).
> 
> sounds good.
> 
> David Lang
> 
> > Rainer
> >
> >> David Lang
> >>
> >>> Rainer
> >>>>
> >>>> Regards,
> >>>> Ken
> >>>>
> >>>> On Tue, Apr 21, 2009 at 08:51:37AM -0700, [email protected] wrote:
> >>>>> On Tue, 21 Apr 2009, Kenneth Marshall wrote:
> >>>>>
> >>>>>> On Tue, Apr 21, 2009 at 08:37:54AM -0700, [email protected] wrote:
> >>>>>>> Kenneth,
> >>>>>>>   could you join the discussion on the rsyslog mailing list?
> >>>>>>> rsyslog-users <[email protected]>
> >>>>>>>
> >>>>>>> I'm surprised to hear you say that rsyslog can already do batch
> >>>> inserts
> >>>>>>> and
> >>>>>>> am interested in how you did that.
> >>>>>>>
> >>>>>>> what sort of insert rate did you mange to get?
> >>>>>>>
> >>>>>>> David Lang
> >>>>>>>
> >>>>>> David,
> >>>>>>
> >>>>>> I would be happy to join the discussion. I did not mean to say
> >>>>>> that rsyslog currently supported batch inserts, just that the
> >>>>>> pieces that provide "stand-by queuing" could be used to manage
> >>>>>> batching inserts.
> >>>>>
> >>>>> I've changed the to list to the rsyslog users list.
> >>>>>
> >>>>> currently the stand-by queuing still handles messages one at a
> >> time.
> >>>>> however a sponser has been found to pay to changing the rsyslog
> >>>> internals
> >>>>> to allow for multiple messages to be handled at once, which is what
> >>>>> triggered some of this discussion.
> >>>>>
> >>>>> which version of rsyslog are you working with?
> >>>>>
> >>>>> when you modified rsyslog to do prepared statement (to avoid the
> >>>> escaping
> >>>>> and parsing) did you hard-code the prepared statement? what other
> >>>> changes
> >>>>> did you make?
> >>>>>
> >>>>> David Lang
> >>>>>
> >>>>>> Cheers,
> >>>>>> Ken
> >>>>>>
> >>>>>>> On Tue, 21 Apr 2009, Kenneth Marshall wrote:
> >>>>>>>
> >>>>>>>> Date: Tue, 21 Apr 2009 08:33:30 -0500
> >>>>>>>> From: Kenneth Marshall <[email protected]>
> >>>>>>>> To: Richard Huxton <[email protected]>
> >>>>>>>> Cc: [email protected], Stephen Frost <[email protected]>,
> >>>>>>>>     Greg Smith <[email protected]>, pgsql-
> >>>> [email protected]
> >>>>>>>> Subject: Re: [PERFORM] performance for high-volume log insertion
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I just finished reading this thread. We are currently working on
> >>>>>>>> setting up a central log system using rsyslog and PostgreSQL. It
> >>>>>>>> works well once we patched the memory leak. We also looked at
> >> what
> >>>>>>>> could be done to improve the efficiency of the DB interface. On
> >>>> the
> >>>>>>>> rsyslog side, moving to prepared queries allows you to remove
> >> the
> >>>>>>>> escaping that needs to be done currently before attempting to
> >>>>>>>> insert the data into the SQL backend as well as removing the
> >>>> parsing
> >>>>>>>> and planning time from the insert. This is a big win for high
> >>>> insert
> >>>>>>>> rates, which is what we are talking about. The escaping process
> >> is
> >>>>>>>> also a big CPU user in rsyslog which then hands the escaped
> >> string
> >>>>>>>> to the backend which then has to undo everything that had been
> >>>> done
> >>>>>>>> and parse/plan the resulting query. This can use a surprising
> >>>> amount
> >>>>>>>> of additional CPU. Even if you cannot support a general prepared
> >>>>>>>> query interface, by specifying what the query should look like
> >> you
> >>>>>>>> can handle much of the low-hanging fruit query-wise.
> >>>>>>>>
> >>>>>>>> We are currently using a date based trigger to use a new
> >> partition
> >>>>>>>> each day and keep 2 months of logs currently. This can be
> >> usefully
> >>>>>>>> managed on the backend database, but if rsyslog supported
> >> changing
> >>>>>>>> the insert to the new table on a time basis, the CPU used by the
> >>>>>>>> trigger to support this on the backend could be reclaimed. This
> >>>>>>>> would be a win for any DB backend. As you move to the new
> >>>> partition,
> >>>>>>>> issuing a truncate to clear the table would simplify the DB
> >>>> interfaces.
> >>>>>>>>
> >>>>>>>> Another performance enhancement already mentioned, would be to
> >>>>>>>> allow certain extra fields in the DB to be automatically
> >> populated
> >>>>>>>> as a function of the log messages. For example, logging the mail
> >>>> queue
> >>>>>>>> id for messages from mail systems would make it much easier to
> >>>> locate
> >>>>>>>> particular mail transactions in large amounts of data.
> >>>>>>>>
> >>>>>>>> To sum up, eliminating the escaping in rsyslog through the use
> >> of
> >>>>>>>> prepared queries would reduce the CPU load on the DB backend.
> >>>> Batching
> >>>>>>>> the inserts will also net you a big performance increase. Some
> >> DB-
> >>>> based
> >>>>>>>> applications allow for the specification of several types of
> >>>> queries,
> >>>>>>>> one for single inserts and then a second to support multiple
> >>>> inserts
> >>>>>>>> (copy). Rsyslog already supports the queuing pieces to allow you
> >>>> to
> >>>>>>>> batch inserts. Just some ideas.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Ken
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Apr 21, 2009 at 09:56:23AM +0100, Richard Huxton wrote:
> >>>>>>>>> [email protected] wrote:
> >>>>>>>>>> On Tue, 21 Apr 2009, Stephen Frost wrote:
> >>>>>>>>>>> * [email protected] ([email protected]) wrote:
> >>>>>>>>>>>> while I fully understand the 'benchmark your situation'
> >> need,
> >>>> this
> >>>>>>>>>>>> isn't
> >>>>>>>>>>>> that simple.
> >>>>>>>>>>>
> >>>>>>>>>>> It really is.  You know your application, you know it's
> >> primary
> >>>> use
> >>>>>>>>>>> cases, and probably have some data to play with.  You're
> >>>> certainly in
> >>>>>>>>>>> a
> >>>>>>>>>>> much better situation to at least *try* and benchmark it than
> >>>> we are.
> >>>>>>>>>> rsyslog is a syslog server. it replaces (or for debian and
> >>>> fedora, has
> >>>>>>>>>> replaced) your standard syslog daemon. it recieves log
> >> messages
> >>>> from
> >>>>>>>>>> every
> >>>>>>>>>> app on your system (and possibly others), filters, maniulates
> >>>> them,
> >>>>>>>>>> and
> >>>>>>>>>> then stores them somewhere. among the places that it can store
> >>>> the
> >>>>>>>>>> logs
> >>>>>>>>>> are database servers (native support for MySQL, PostgreSQL,
> >> and
> >>>>>>>>>> Oracle.
> >>>>>>>>>> plus libdbi for others)
> >>>>>>>>>
> >>>>>>>>> Well, from a performance standpoint the obvious things to do
> >> are:
> >>>>>>>>> 1. Keep a connection open, do NOT reconnect for each log-
> >>>> statement
> >>>>>>>>> 2. Batch log statements together where possible
> >>>>>>>>> 3. Use prepared statements
> >>>>>>>>> 4. Partition the tables by day/week/month/year (configurable I
> >>>> suppose)
> >>>>>>>>>
> >>>>>>>>> The first two are vital, the third takes you a step further.
> >> The
> >>>> fourth
> >>>>>>>>> is
> >>>>>>>>> a long-term admin thing.
> >>>>>>>>>
> >>>>>>>>> And possibly
> >>>>>>>>> 5. Have two connections, one for fatal/error etc and one for
> >>>> info/debug
> >>>>>>>>> level log statements (configurable split?). Then you can use
> >> the
> >>>>>>>>> synchronous_commit setting on the less important ones. Might
> >> buy
> >>>> you
> >>>>>>>>> some
> >>>>>>>>> performance on a busy system.
> >>>>>>>>>
> >>>>>>>>> http://www.postgresql.org/docs/8.3/interactive/runtime-config-
> >>>> wal.html#RUNTIME-CONFIG-WAL-SETTINGS
> >>>>>>>>>
> >>>>>>>>>> other apps then search and report on the data after it is
> >>>> stored. what
> >>>>>>>>>> apps?, I don't know either. pick your favorite reporting tool
> >>>> and
> >>>>>>>>>> you'll
> >>>>>>>>>> be a step ahead of me (I don't know a really good reporting
> >>>> tool)
> >>>>>>>>>> as for sample data, you have syslog messages, just like I do.
> >> so
> >>>> you
> >>>>>>>>>> have
> >>>>>>>>>> the same access to data that I have.
> >>>>>>>>>> how would you want to query them? how would people far less
> >>>>>>>>>> experianced
> >>>>>>>>>> that you want to query them?
> >>>>>>>>>> I can speculate that some people would do two columns (time,
> >>>>>>>>>> everything
> >>>>>>>>>> else), others will do three (time, server, everything else),
> >> and
> >>>>>>>>>> others
> >>>>>>>>>> will go further (I know some who would like to extract IP
> >>>> addresses
> >>>>>>>>>> embedded in a message into their own column). some people will
> >>>> index
> >>>>>>>>>> on
> >>>>>>>>>> the time and host, others will want to do full-text searches
> >> of
> >>>>>>>>>> everything.
> >>>>>>>>>
> >>>>>>>>> Well, assuming it looks much like traditional syslog, I would
> >> do
> >>>>>>>>> something
> >>>>>>>>> like: (timestamp, host, facility, priority, message). It's easy
> >>>> enough
> >>>>>>>>> to
> >>>>>>>>> stitch back together if people want that.
> >>>>>>>>>
> >>>>>>>>> PostgreSQL's full-text indexing is quite well suited to
> >> logfiles
> >>>> I'd
> >>>>>>>>> have
> >>>>>>>>> thought, since it knows about filenames, urls etc already.
> >>>>>>>>>
> >>>>>>>>> If you want to get fancy, add a msg_type column and one
> >>>> subsidiary
> >>>>>>>>> table
> >>>>>>>>> for each msg_type. So - you might have smtp_connect_from
> >>>> (hostname,
> >>>>>>>>> ip_addr). A set of perl regexps can match and extract the
> >> fields
> >>>> for
> >>>>>>>>> these
> >>>>>>>>> extra tables, or you could do it with triggers inside the
> >>>> database. I
> >>>>>>>>> think
> >>>>>>>>> it makes sense to do it in the application. Easier for users to
> >>>>>>>>> contribute
> >>>>>>>>> new patterns/extractions. Meanwhile, the core table is
> >> untouched
> >>>> so you
> >>>>>>>>> don't *need* to know about these extra tables.
> >>>>>>>>>
> >>>>>>>>> If you have subsidiary tables, you'll want to partition those
> >> too
> >>>> and
> >>>>>>>>> perhaps stick them in their own schema (logs200901, logs200902
> >>>> etc).
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>>   Richard Huxton
> >>>>>>>>>   Archonet Ltd
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Sent via pgsql-performance mailing list
> >>>>>>>>> ([email protected])
> >>>>>>>>> To make changes to your subscription:
> >>>>>>>>> http://www.postgresql.org/mailpref/pgsql-performance
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Sent via pgsql-performance mailing list
> >>>>>>> ([email protected])
> >>>>>>> To make changes to your subscription:
> >>>>>>> http://www.postgresql.org/mailpref/pgsql-performance
> >>>>>>>
> >>>>>>
> >>>>>
> >>>> _______________________________________________
> >>>> rsyslog mailing list
> >>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >>>> http://www.rsyslog.com
> >>> _______________________________________________
> >>> rsyslog mailing list
> >>> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >>> http://www.rsyslog.com
> >>>
> >> _______________________________________________
> >> rsyslog mailing list
> >> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >> http://www.rsyslog.com
> > _______________________________________________
> > rsyslog mailing list
> > http://lists.adiscon.net/mailman/listinfo/rsyslog
> > http://www.rsyslog.com
> >
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
> 
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Re: [rsyslog] [PERFORM] performance for high-volume log insertion

Reply via email to