Re: [rsyslog] [PERFORM] performance for high-volume log insertion

Rainer Gerhards Tue, 21 Apr 2009 23:33:16 -0700

> -----Original Message-----
> From: [email protected] [mailto:rsyslog-
> [email protected]] On Behalf Of [email protected]
> Sent: Wednesday, April 22, 2009 8:26 AM
> To: rsyslog-users
> Subject: Re: [rsyslog] [PERFORM] performance for high-volume log
> insertion
> 
> On Wed, 22 Apr 2009, Rainer Gerhards wrote:
> 
> > Hi Ken,
> >
> > glad to have you here. I am a bit silent at the moment, because I am
> not a
> > real database guy and so I am primarily listening to any information
> that is
> > incoming. If you have a couple of minutes, it would be useful to
> review this
> > thread here:
> >
> > http://lists.adiscon.net/pipermail/rsyslog/2009-April/002003.html
> >
> > ...one comment inline below...
> >
> >> -----Original Message-----
> >> From: [email protected] [mailto:rsyslog-
> >> [email protected]] On Behalf Of Kenneth Marshall
> >>
> >> David,
> >>
> >> Okay, I am now subscribed to the mailing list. We are currently
> using
> >> rsyslog-3.20.x. As far as implementing a prototype of the prepared
> >> statement, I was sidetracked by other duties and have not had a
> chance
> >> to do anything but an initial evaluation. As far as the rsyslog
> >> internal
> >> escaping, it looked simplest to create another template type like
> the
> >> current SQL and STDSQL that indicated that escaping was not needed
> >> and/or that prepared statements should be used.
> >
> > To disable escaping, simply do not use SQL or STDSQL. However, the db
> outputs
> > currently require this option (easy to disable), because I cannot see
> how it
> > will work (with the existing code) without escaping.
> >
> > Any idea is most welcome.
> 
> when using prepared statement escaping is not needed.


... but that can only be on the non-text API level (e.g. by using libpq). On
the SQL text level, I have no idea how to tell the sql engine to insert ' -
if I don't say '''' but ''' how does the engine know what I say? With the
C-level API, I bind a parameter and specify a buffer and then put my
character into that buffer - no escaping needed for sure. But, again, I do
not see how this would work on the text level...

> according to
> Ken's
> message below he found that the overhead of doing the escaping was
> significant. I don't see why this should be the case, but if it
> requires
> making an extra copy of the string I guess it's possible
> 

I am surprised, too. Even if a copy is made (I think it is), this is a quick
in-memory operation. Given the rest of the picture, I would expect that to
have very low impact on the overall cost (just think about the need to copy
buffers between different contexts, down to different layers, etc - so I'd
expect to see ample copy operations before the data finally hits the disk).
Anyhow, I may be totally wrong...

> I plan to get a setup togeather in the next couple of days that will
> let
> me do some testing of the options on the database side.
> 

That would be great. I, for now, intend to look at the queue first. I have
begun to thought about steps on how to tackle the beast. So I will probably
not do much more on the database level than throw in some thoughts (but not
do any testing or coding).

Rainer

> David Lang
> 
> > Rainer
> >>
> >> Regards,
> >> Ken
> >>
> >> On Tue, Apr 21, 2009 at 08:51:37AM -0700, [email protected] wrote:
> >>> On Tue, 21 Apr 2009, Kenneth Marshall wrote:
> >>>
> >>>> On Tue, Apr 21, 2009 at 08:37:54AM -0700, [email protected] wrote:
> >>>>> Kenneth,
> >>>>>   could you join the discussion on the rsyslog mailing list?
> >>>>> rsyslog-users <[email protected]>
> >>>>>
> >>>>> I'm surprised to hear you say that rsyslog can already do batch
> >> inserts
> >>>>> and
> >>>>> am interested in how you did that.
> >>>>>
> >>>>> what sort of insert rate did you mange to get?
> >>>>>
> >>>>> David Lang
> >>>>>
> >>>> David,
> >>>>
> >>>> I would be happy to join the discussion. I did not mean to say
> >>>> that rsyslog currently supported batch inserts, just that the
> >>>> pieces that provide "stand-by queuing" could be used to manage
> >>>> batching inserts.
> >>>
> >>> I've changed the to list to the rsyslog users list.
> >>>
> >>> currently the stand-by queuing still handles messages one at a
> time.
> >>> however a sponser has been found to pay to changing the rsyslog
> >> internals
> >>> to allow for multiple messages to be handled at once, which is what
> >>> triggered some of this discussion.
> >>>
> >>> which version of rsyslog are you working with?
> >>>
> >>> when you modified rsyslog to do prepared statement (to avoid the
> >> escaping
> >>> and parsing) did you hard-code the prepared statement? what other
> >> changes
> >>> did you make?
> >>>
> >>> David Lang
> >>>
> >>>> Cheers,
> >>>> Ken
> >>>>
> >>>>> On Tue, 21 Apr 2009, Kenneth Marshall wrote:
> >>>>>
> >>>>>> Date: Tue, 21 Apr 2009 08:33:30 -0500
> >>>>>> From: Kenneth Marshall <[email protected]>
> >>>>>> To: Richard Huxton <[email protected]>
> >>>>>> Cc: [email protected], Stephen Frost <[email protected]>,
> >>>>>>     Greg Smith <[email protected]>, pgsql-
> >> [email protected]
> >>>>>> Subject: Re: [PERFORM] performance for high-volume log insertion
> >>>>>> Hi,
> >>>>>>
> >>>>>> I just finished reading this thread. We are currently working on
> >>>>>> setting up a central log system using rsyslog and PostgreSQL. It
> >>>>>> works well once we patched the memory leak. We also looked at
> what
> >>>>>> could be done to improve the efficiency of the DB interface. On
> >> the
> >>>>>> rsyslog side, moving to prepared queries allows you to remove
> the
> >>>>>> escaping that needs to be done currently before attempting to
> >>>>>> insert the data into the SQL backend as well as removing the
> >> parsing
> >>>>>> and planning time from the insert. This is a big win for high
> >> insert
> >>>>>> rates, which is what we are talking about. The escaping process
> is
> >>>>>> also a big CPU user in rsyslog which then hands the escaped
> string
> >>>>>> to the backend which then has to undo everything that had been
> >> done
> >>>>>> and parse/plan the resulting query. This can use a surprising
> >> amount
> >>>>>> of additional CPU. Even if you cannot support a general prepared
> >>>>>> query interface, by specifying what the query should look like
> you
> >>>>>> can handle much of the low-hanging fruit query-wise.
> >>>>>>
> >>>>>> We are currently using a date based trigger to use a new
> partition
> >>>>>> each day and keep 2 months of logs currently. This can be
> usefully
> >>>>>> managed on the backend database, but if rsyslog supported
> changing
> >>>>>> the insert to the new table on a time basis, the CPU used by the
> >>>>>> trigger to support this on the backend could be reclaimed. This
> >>>>>> would be a win for any DB backend. As you move to the new
> >> partition,
> >>>>>> issuing a truncate to clear the table would simplify the DB
> >> interfaces.
> >>>>>>
> >>>>>> Another performance enhancement already mentioned, would be to
> >>>>>> allow certain extra fields in the DB to be automatically
> populated
> >>>>>> as a function of the log messages. For example, logging the mail
> >> queue
> >>>>>> id for messages from mail systems would make it much easier to
> >> locate
> >>>>>> particular mail transactions in large amounts of data.
> >>>>>>
> >>>>>> To sum up, eliminating the escaping in rsyslog through the use
> of
> >>>>>> prepared queries would reduce the CPU load on the DB backend.
> >> Batching
> >>>>>> the inserts will also net you a big performance increase. Some
> DB-
> >> based
> >>>>>> applications allow for the specification of several types of
> >> queries,
> >>>>>> one for single inserts and then a second to support multiple
> >> inserts
> >>>>>> (copy). Rsyslog already supports the queuing pieces to allow you
> >> to
> >>>>>> batch inserts. Just some ideas.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Ken
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Apr 21, 2009 at 09:56:23AM +0100, Richard Huxton wrote:
> >>>>>>> [email protected] wrote:
> >>>>>>>> On Tue, 21 Apr 2009, Stephen Frost wrote:
> >>>>>>>>> * [email protected] ([email protected]) wrote:
> >>>>>>>>>> while I fully understand the 'benchmark your situation'
> need,
> >> this
> >>>>>>>>>> isn't
> >>>>>>>>>> that simple.
> >>>>>>>>>
> >>>>>>>>> It really is.  You know your application, you know it's
> primary
> >> use
> >>>>>>>>> cases, and probably have some data to play with.  You're
> >> certainly in
> >>>>>>>>> a
> >>>>>>>>> much better situation to at least *try* and benchmark it than
> >> we are.
> >>>>>>>> rsyslog is a syslog server. it replaces (or for debian and
> >> fedora, has
> >>>>>>>> replaced) your standard syslog daemon. it recieves log
> messages
> >> from
> >>>>>>>> every
> >>>>>>>> app on your system (and possibly others), filters, maniulates
> >> them,
> >>>>>>>> and
> >>>>>>>> then stores them somewhere. among the places that it can store
> >> the
> >>>>>>>> logs
> >>>>>>>> are database servers (native support for MySQL, PostgreSQL,
> and
> >>>>>>>> Oracle.
> >>>>>>>> plus libdbi for others)
> >>>>>>>
> >>>>>>> Well, from a performance standpoint the obvious things to do
> are:
> >>>>>>> 1. Keep a connection open, do NOT reconnect for each log-
> >> statement
> >>>>>>> 2. Batch log statements together where possible
> >>>>>>> 3. Use prepared statements
> >>>>>>> 4. Partition the tables by day/week/month/year (configurable I
> >> suppose)
> >>>>>>>
> >>>>>>> The first two are vital, the third takes you a step further.
> The
> >> fourth
> >>>>>>> is
> >>>>>>> a long-term admin thing.
> >>>>>>>
> >>>>>>> And possibly
> >>>>>>> 5. Have two connections, one for fatal/error etc and one for
> >> info/debug
> >>>>>>> level log statements (configurable split?). Then you can use
> the
> >>>>>>> synchronous_commit setting on the less important ones. Might
> buy
> >> you
> >>>>>>> some
> >>>>>>> performance on a busy system.
> >>>>>>>
> >>>>>>> http://www.postgresql.org/docs/8.3/interactive/runtime-config-
> >> wal.html#RUNTIME-CONFIG-WAL-SETTINGS
> >>>>>>>
> >>>>>>>> other apps then search and report on the data after it is
> >> stored. what
> >>>>>>>> apps?, I don't know either. pick your favorite reporting tool
> >> and
> >>>>>>>> you'll
> >>>>>>>> be a step ahead of me (I don't know a really good reporting
> >> tool)
> >>>>>>>> as for sample data, you have syslog messages, just like I do.
> so
> >> you
> >>>>>>>> have
> >>>>>>>> the same access to data that I have.
> >>>>>>>> how would you want to query them? how would people far less
> >>>>>>>> experianced
> >>>>>>>> that you want to query them?
> >>>>>>>> I can speculate that some people would do two columns (time,
> >>>>>>>> everything
> >>>>>>>> else), others will do three (time, server, everything else),
> and
> >>>>>>>> others
> >>>>>>>> will go further (I know some who would like to extract IP
> >> addresses
> >>>>>>>> embedded in a message into their own column). some people will
> >> index
> >>>>>>>> on
> >>>>>>>> the time and host, others will want to do full-text searches
> of
> >>>>>>>> everything.
> >>>>>>>
> >>>>>>> Well, assuming it looks much like traditional syslog, I would
> do
> >>>>>>> something
> >>>>>>> like: (timestamp, host, facility, priority, message). It's easy
> >> enough
> >>>>>>> to
> >>>>>>> stitch back together if people want that.
> >>>>>>>
> >>>>>>> PostgreSQL's full-text indexing is quite well suited to
> logfiles
> >> I'd
> >>>>>>> have
> >>>>>>> thought, since it knows about filenames, urls etc already.
> >>>>>>>
> >>>>>>> If you want to get fancy, add a msg_type column and one
> >> subsidiary
> >>>>>>> table
> >>>>>>> for each msg_type. So - you might have smtp_connect_from
> >> (hostname,
> >>>>>>> ip_addr). A set of perl regexps can match and extract the
> fields
> >> for
> >>>>>>> these
> >>>>>>> extra tables, or you could do it with triggers inside the
> >> database. I
> >>>>>>> think
> >>>>>>> it makes sense to do it in the application. Easier for users to
> >>>>>>> contribute
> >>>>>>> new patterns/extractions. Meanwhile, the core table is
> untouched
> >> so you
> >>>>>>> don't *need* to know about these extra tables.
> >>>>>>>
> >>>>>>> If you have subsidiary tables, you'll want to partition those
> too
> >> and
> >>>>>>> perhaps stick them in their own schema (logs200901, logs200902
> >> etc).
> >>>>>>>
> >>>>>>> --
> >>>>>>>   Richard Huxton
> >>>>>>>   Archonet Ltd
> >>>>>>>
> >>>>>>> --
> >>>>>>> Sent via pgsql-performance mailing list
> >>>>>>> ([email protected])
> >>>>>>> To make changes to your subscription:
> >>>>>>> http://www.postgresql.org/mailpref/pgsql-performance
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Sent via pgsql-performance mailing list
> >>>>> ([email protected])
> >>>>> To make changes to your subscription:
> >>>>> http://www.postgresql.org/mailpref/pgsql-performance
> >>>>>
> >>>>
> >>>
> >> _______________________________________________
> >> rsyslog mailing list
> >> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >> http://www.rsyslog.com
> > _______________________________________________
> > rsyslog mailing list
> > http://lists.adiscon.net/mailman/listinfo/rsyslog
> > http://www.rsyslog.com
> >
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Re: [rsyslog] [PERFORM] performance for high-volume log insertion

Reply via email to