Nick Arnett <[EMAIL PROTECTED]> wrote on 02/07/2005 02:06:30 PM:

> I'm considering the best strategy for updating large tables that I'm
> indexing with Lucene.  I'm 
> indexing in chunks of 5000 items at a time, which I select with 
> LIMIT statements and an incremental 
> multiplier (SELECT blah FROM foo LIMIT 0, 5000; SELECT blah FROM foo
> LIMIT 5000, 5000; etc.).
> 
> After a group is added to the Lucene index, I want to update a 
> column that is a flag to show that 
> the record is done (TINYINT(1) UNSIGNED NOT NULL).  This is to 
> support incremental indexing (and so 
> that if an indexing process fails or is interrupted, it will resume 
> with minimal redundancy).
> 
> The best approach I've imagined is to create a HEAP table and for 
> each cycle, fill it with the next 
> 5000 primary keys from the source table, SELECT using a join against
> the HEAP table, the UPDATE on 
> the same join when the Lucene indexing is successful.
> 
> Anybody suggest anything else?  I looked at using LOCK IN SHARE 
> MODE, but I don't see how to be sure 
> that a SELECT ... LIMIT... statement will select the same messages 
> as UPDATE... LIMIT... unless I 
> lock the whole table, which seems needless.
> 
> Nick
> 
> 

Your solution is as good as any I can think of. If your source tables are 
InnoDB you could wrap your SELECT/UPDATE processing in a transaction to 
help make it even more robust. It might speed things up if you omit the 
offset to your LIMIT clause and just do: 

SELECT pkfield
FROM source_table
WHERE index_flag=0
LIMIT 5000;

that way you always find the "first" 5000 un-indexed records, regardless 
of physical position. Record order is not guaranteed unless you use an 
ORDER BY statement and that would just slow you down, wouldn't it?


Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine

Reply via email to