On 10/12/2010 6:01 PM, Lawrence D'Oliveiro wrote:
In message<4cb4ba4e$0$1641$742ec...@news.sonic.net>, John Nagle wrote:

In general, if you find yourself making millions of
SQL database requests in a loop, you're doing it wrong.

I’ve done this. Not millions, but certainly on the order of tens of
thousands.

   It's a scaling issue.  Millions of INSERT or UPDATE requests can
take hours.  That's when you need the speedup of bulk loading.

      Big database loads are usually done by creating a text file
with the desired data, then using a LOAD DATA INFILE command.

May not always be flexible enough.

   True; you can't do anything with LOAD DATA INFILE but load data.
If you need selects or joins within inserts, you may have to do it
the long way.

This (in MySQL) is tens to hundreds of times faster than doing individual
INSERT or UPDATE commands.

Why should that be? The database cannot read a text file any faster than I
can.

    Because the indices are updated in one big update, rather than
after each change to the database.

    Also note that there are some issues with doing a huge volume of
updates in one MySQL InnoDB transaction.  The system has to keep the
data needed to undo the updates, and there's a limit on the amount of
pending transaction history that can be stored.

    It's common to load data into a new, empty table, then, once
the reload has succeeded, do a RENAME like CURRENT->OLD, NEW->CURRENT.
Rename of multiple databases is atomic and interlocked with other
operations, so you can replace an entire table on a live server.

    I have some bulk databases which are updated from external
sources.  The PhishTank database is updated with UPDATE statements
every three hours. But the Open Directory database is updated by
downloading a big gzipped file of XML, creating a new database
table, then renaming.  That load takes hours, once a week.

    (All this applies to MySQL, and to some extent, Postgres.
If you're using SQLite, it's different.  But a million records
is big for SQLite, which is, after all, a "lite" database.
At 10,000 records, you don't have to worry about any of this
stuff.  At 1,000,000 records, you do.)

                                        John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to