Re: SELECT and UPDATE together, with LIMIT? (incremental external indexing)

2005-02-07 Thread SGreen
Nick Arnett [EMAIL PROTECTED] wrote on 02/07/2005 02:06:30 PM:

 I'm considering the best strategy for updating large tables that I'm
 indexing with Lucene.  I'm 
 indexing in chunks of 5000 items at a time, which I select with 
 LIMIT statements and an incremental 
 multiplier (SELECT blah FROM foo LIMIT 0, 5000; SELECT blah FROM foo
 LIMIT 5000, 5000; etc.).
 
 After a group is added to the Lucene index, I want to update a 
 column that is a flag to show that 
 the record is done (TINYINT(1) UNSIGNED NOT NULL).  This is to 
 support incremental indexing (and so 
 that if an indexing process fails or is interrupted, it will resume 
 with minimal redundancy).
 
 The best approach I've imagined is to create a HEAP table and for 
 each cycle, fill it with the next 
 5000 primary keys from the source table, SELECT using a join against
 the HEAP table, the UPDATE on 
 the same join when the Lucene indexing is successful.
 
 Anybody suggest anything else?  I looked at using LOCK IN SHARE 
 MODE, but I don't see how to be sure 
 that a SELECT ... LIMIT... statement will select the same messages 
 as UPDATE... LIMIT... unless I 
 lock the whole table, which seems needless.
 
 Nick
 
 

Your solution is as good as any I can think of. If your source tables are 
InnoDB you could wrap your SELECT/UPDATE processing in a transaction to 
help make it even more robust. It might speed things up if you omit the 
offset to your LIMIT clause and just do: 

SELECT pkfield
FROM source_table
WHERE index_flag=0
LIMIT 5000;

that way you always find the first 5000 un-indexed records, regardless 
of physical position. Record order is not guaranteed unless you use an 
ORDER BY statement and that would just slow you down, wouldn't it?


Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine



Re: SELECT and UPDATE together, with LIMIT? (incremental external indexing)

2005-02-07 Thread Nick Arnett
[EMAIL PROTECTED] wrote:
Your solution is as good as any I can think of. If your source tables 
are InnoDB you could wrap your SELECT/UPDATE processing in a transaction 
to help make it even more robust. It might speed things up if you omit 
the offset to your LIMIT clause and just do:
Yes, they are InnoDB and I routinely now do such things in transactions.  And I did remove the LIMIT 
offset, which was from the first tests.

that way you always find the first 5000 un-indexed records, regardless 
of physical position. Record order is not guaranteed unless you use an 
ORDER BY statement and that would just slow you down, wouldn't it?
Indeed, which was one reason I went with the HEAP table... which is also a TEMPORARY table so that 
two clients can work concurrently.

Thanks.
Nick
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]