Re: [rsyslog] [PERFORM] performance for high-volume loginsertion(fwd)

david Thu, 23 Apr 2009 23:45:39 -0700

On Fri, 24 Apr 2009, Rainer Gerhards wrote:

-----Original Message-----
From: [email protected] [mailto:rsyslog-
[email protected]] On Behalf Of [email protected]
Sent: Friday, April 24, 2009 7:57 AM
To: rsyslog-users
Subject: Re: [rsyslog] [PERFORM] performance for high-volume
loginsertion(fwd)


On Fri, 24 Apr 2009, Rainer Gerhards wrote:

Another innocent question:

Let's say I used an exec() API exclusively. Now let me assume that I

do, on

the *same* database connection, this calling sequence:

exec("begin transaction")
exec("insert ...")
exec("insert ...")
exec("insert ...")
exec("insert ...")
exec("insert ...")
exec("insert ...")   [Point A]
exec("commit")

Is it safe to assume that this will result in a performance benefit

(I know

that it causes more network traffic than necessary, but that's not my

point -

I just talk of speedup). Will this performance speedup be

considerable (along

the magnitude of 20 vs. 3 seconds for a given sequence?).


Yes, this speedup would be considerable

from the message at the bottom it would be on the order of

separate inserts, no transaction: 21.21s
separate inserts, same transaction: 1.89s


I read this, just wanted some reconfirmation.

consider it confirmed. I've run my own tests in the past and you get aHUGE benifit from this one step. depending on the particular database youmay be able to continue to see benifits well beyond 100 inserts in a batch(I would start my testing at 100 and plan on going up to 1000)


there is still another order of magnatude gain to be had by going to
the
copy (and eliminating the extra round trips)

COPY (text): 0.10s


Definitely, but let's tackle the 90% issue first.


a copy looks something like

copy to table X from STDIN
data
data
data

Also, even more importantly, does this really many they are all in

one

transaction?


yes.

In particular, what happens if the connection breaks at [Point
A], e.g. by the network connection going down for an extended period

of time.

Is it safe to assume that then everything will be rolled back?


yes, every one of them would dissappear.


So it looks my three-call (beginBatch, pushData, EndBatch) calling interface
can probably work. I need to work on how non-transactional outputs can convey
what they have commited, but the basic interface looks rather good.

yes, although there is benifit in making these not be seperate execstatements but instead sending them to the database as you go along (Idon't know the library well enough to know how to do a non-blocking calllike this) or crafting one long string and sending it all at once. even ifthe pieces are generated by seperate write calls on the networkfilehandle, with a TCP datastream (and a fast sender), the number ofround-trips may be far fewer than you think (what you create as seperateexec statements

my earlier 4-part proposal (start, mid, stop, data) is _slightly_ moreflexible in that it has the mid/joiv variable, allowing for something toappear between points of data, but not at the end.


i.e.

insert into table X values (),();

your 3-part version would end up with an extra , at the end.

while this isn't critical it is an easy way to gain about another factorof 10


David Lang

Rainer

David Lang

Feedback is appreciated.

Rainer

-----Original Message-----
From: [email protected] [mailto:rsyslog-
[email protected]] On Behalf Of Rainer Gerhards
Sent: Thursday, April 23, 2009 4:38 PM
To: rsyslog-users
Subject: Re: [rsyslog] [PERFORM] performance for high-volume
loginsertion(fwd)

That's interesting. As a side-activity, I am thinking about a new
output
module interface. Especially given the discussion on the postgres

list,

but
also some other thoughts about other modules (e.g. omtcp or the file
output),
I tend to use an approach that permits both string-based as well as
API-based
(API as in libpq) ways of doing things. I have not really designed
anything,
but the rough idea is that each plugin needs three entry points:

- start batch
- process single message
- end batch

Then, the plugin can decide itself what it wants to do and when.

Most

importantly, this calling interface works well for string-based
transactions
as well as API-based ones.

For the output file writer, for example, I envision that over time

it

will
have its own write buffer (for various reasons, for example I am

also

discussing zipped writing with some folks). With this interface, I

can

put
everything into the buffer, write out if needed but not if there is

no

immediate need but I can make sure that I write out when the "end
batch"
entry point is called.

As I said, it is not really thought out yet, but maybe a starting
point. So
feedback is appreciated.

Rainer

-----Original Message-----
From: [email protected] [mailto:rsyslog-
[email protected]] On Behalf Of [email protected]
Sent: Wednesday, April 22, 2009 10:11 PM
To: rsyslog-users
Subject: Re: [rsyslog] [PERFORM] performance for high-volume log
insertion(fwd)

from the postgres performance mailing list, relative speeds of
different
ways of inserting data.

I've asked if the 'seperate inserts' mode is seperate round trips

or

many
inserts in one round trip.

based on this it looks like prepared statements make a difference,

but

not
so much that other techniques (either a single statement or a copy)
aren't
comparable (or better) options.

David Lang

---------- Forwarded message ----------
Date: Wed, 22 Apr 2009 15:33:21 -0400
From: Glenn Maynard <[email protected]>
To: [email protected]
Subject: Re: [PERFORM] performance for high-volume log insertion

On Wed, Apr 22, 2009 at 8:19 AM, Stephen Frost <[email protected]>
wrote:

Yes, as I beleive was mentioned already, planning time for inserts

is

really small.  Parsing time for inserts when there's little

parsing

that

has to happen also isn't all *that* expensive and the same goes

for

conversions from textual representations of data to binary.

We're starting to re-hash things, in my view.  The low-hanging

fruit

is

doing multiple things in a single transaction, either by using

COPY,

multi-value INSERTs, or just multiple INSERTs in a single

transaction.

That's absolutely step one.


This is all well-known, covered information, but perhaps some

numbers

will help drive this home.  40000 inserts into a single-column,
unindexed table; with predictable results:

separate inserts, no transaction: 21.21s
separate inserts, same transaction: 1.89s
40 inserts, 100 rows/insert: 0.18s
one 40000-value insert: 0.16s
40 prepared inserts, 100 rows/insert: 0.15s
COPY (text): 0.10s
COPY (binary): 0.10s

Of course, real workloads will change the weights, but this is more

or

less the magnitude of difference I always see--batch your inserts

into

single statements, and if that's not enough, skip to COPY.

--
Glenn Maynard

--
Sent via pgsql-performance mailing list (pgsql-
[email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Re: [rsyslog] [PERFORM] performance for high-volume loginsertion(fwd)

Reply via email to