Re: How batch processing works

Lok P Sat, 21 Sep 2024 08:25:49 -0700

On Sat, Sep 21, 2024 at 8:07 PM Peter J. Holzer <hjp-pg...@hjp.at> wrote:


> On 2024-09-21 16:44:08 +0530, Lok P wrote:
> > But wondering why we don't see any difference in performance between
> method-2
> > and method-3 above.
>
> The code runs completely inside the database. So there isn't much
> difference between a single statement which inserts 50 rows and 50
> statements which insert 1 row each. The work to be done is (almost) the
> same.
>
> This changes once you consider an application which runs outside of the
> database (maybe even on a different host). Such an application has to
> wait for the result of each statement before it can send the next one.
> Now it makes a difference whether you are waiting 50 times for a
> statement which does very little or just once for a statement which does
> more work.
>
> > So does it mean that,I am testing this in a wrong way or
>
> That depends on what you want to test. If you are interested in the
> behaviour of stored procedures, the test is correct. If you want to know
> about the performance of a database client (whether its written in Java,
> Python, Go or whatever), this is the wrong test. You have to write the
> test in your target language and run it on the client system to get
> realistic results (for example, the round-trip times will be a lot
> shorter if the client and database are on the same computer than when
> one is in Europe and the other in America).
>
> For example, here are the three methods as Python scripts:
>
>
> ---------------------------------------------------------------------------------------------------
> #!/usr/bin/python3
>
> import time
> import psycopg2
>
> num_inserts = 10_000
>
> db = psycopg2.connect()
> csr = db.cursor()
>
> csr.execute("drop table if exists parent_table")
> csr.execute("create table parent_table (id int primary key, t text)")
>
> start_time = time.monotonic()
> for i in range(1, num_inserts+1):
>     csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
>     db.commit()
> end_time = time.monotonic()
> elapsed_time = end_time - start_time
> print(f"Method 1: Individual Inserts with Commit after every Row:
> {elapsed_time:.3} seconds")
>
> # vim: tw=99
>
> ---------------------------------------------------------------------------------------------------
> #!/usr/bin/python3
>
> import time
> import psycopg2
>
> num_inserts = 10_000
> batch_size = 50
>
> db = psycopg2.connect()
> csr = db.cursor()
>
> csr.execute("drop table if exists parent_table")
> csr.execute("create table parent_table (id int primary key, t text)")
> db.commit()
>
> start_time = time.monotonic()
> for i in range(1, num_inserts+1):
>     csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
>     if i % batch_size == 0:
>         db.commit()
> db.commit()
> end_time = time.monotonic()
> elapsed_time = end_time - start_time
> print(f"Method 2: Individual Inserts with Commit after {batch_size}  Rows:
> {elapsed_time:.3} seconds")
>
> # vim: tw=99
>
> ---------------------------------------------------------------------------------------------------
> #!/usr/bin/python3
>
> import itertools
> import time
> import psycopg2
>
> num_inserts = 10_000
> batch_size = 50
>
> db = psycopg2.connect()
> csr = db.cursor()
>
> csr.execute("drop table if exists parent_table")
> csr.execute("create table parent_table (id int primary key, t text)")
> db.commit()
>
> start_time = time.monotonic()
> batch = []
> for i in range(1, num_inserts+1):
>     batch.append((i, 'a'))
>     if i % batch_size == 0:
>         q = "insert into parent_table values" + ",".join(["(%s, %s)"] *
> len(batch))
>         params = list(itertools.chain.from_iterable(batch))
>         csr.execute(q, params)
>         db.commit()
>         batch = []
> if batch:
>     q = "insert into parent_table values" + ",".join(["(%s, %s)"] *
> len(batch))
>     csr.execute(q, list(itertools.chain(batch)))
>     db.commit()
>     batch = []
>
> end_time = time.monotonic()
> elapsed_time = end_time - start_time
> print(f"Method 3: Batch Inserts ({batch_size})  with Commit after each
> batch: {elapsed_time:.3} seconds")
>
> # vim: tw=99
>
> ---------------------------------------------------------------------------------------------------
>
> On my laptop, method2 is about twice as fast as method3. But if I
> connect to a database on the other side of the city, method2 is now more
> than 16 times faster than method3 . Simply because the delay in
> communication is now large compared to the time it takes to insert those
> rows.
>
>
Thank you so much.
I was expecting method-3(batch insert) to be the fastest or atleast as you
said perform with similar speed as method-2 (row by row insert with batch
commit) if we do it within the procedure inside the database. But because
the context switching will be minimal in method-3 as it will prepare the
insert and submit to the database in one shot in one DB call, so it should
be a bit fast. But from your figures , it appears to be the opposite ,
i.e.method-2 is faster than method-3. Not able to understand the reason
though. So in this case then ,it appears we can follow method-2 as that is
cheaper in regards to less code change , i.e. just shifting the commit
points without any changes for doing the batch insert.

Btw,Do you have any thoughts,  why method-2 is faster as compared to
method-3 in your test?

Re: How batch processing works

Reply via email to