Nick Arnett [EMAIL PROTECTED] wrote on 02/07/2005 02:06:30 PM:
I'm considering the best strategy for updating large tables that I'm
indexing with Lucene. I'm
indexing in chunks of 5000 items at a time, which I select with
LIMIT statements and an incremental
multiplier (SELECT blah FROM foo LIMIT 0, 5000; SELECT blah FROM foo
LIMIT 5000, 5000; etc.).
After a group is added to the Lucene index, I want to update a
column that is a flag to show that
the record is done (TINYINT(1) UNSIGNED NOT NULL). This is to
support incremental indexing (and so
that if an indexing process fails or is interrupted, it will resume
with minimal redundancy).
The best approach I've imagined is to create a HEAP table and for
each cycle, fill it with the next
5000 primary keys from the source table, SELECT using a join against
the HEAP table, the UPDATE on
the same join when the Lucene indexing is successful.
Anybody suggest anything else? I looked at using LOCK IN SHARE
MODE, but I don't see how to be sure
that a SELECT ... LIMIT... statement will select the same messages
as UPDATE... LIMIT... unless I
lock the whole table, which seems needless.
Nick
Your solution is as good as any I can think of. If your source tables are
InnoDB you could wrap your SELECT/UPDATE processing in a transaction to
help make it even more robust. It might speed things up if you omit the
offset to your LIMIT clause and just do:
SELECT pkfield
FROM source_table
WHERE index_flag=0
LIMIT 5000;
that way you always find the first 5000 un-indexed records, regardless
of physical position. Record order is not guaranteed unless you use an
ORDER BY statement and that would just slow you down, wouldn't it?
Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine