On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrews...@alaska.edu> wrote:
> Thanks to a change in historical data, I have a need to update a large > number of records (around 50 million). The update itself is straight > forward, as I can just issue an "UPDATE table_name SET > changed_field=new_value();" (yes, new_value is the result of a stored > procedure, if that makes a difference) command via psql, and it should > work. However, due to the large number of records this command will > obviously take a while, and if anything goes wrong during the update (one > bad value in row 45 million, lost connection, etc), all the work that has > been done already will be lost due to the transactional nature of such > commands (unless I am missing something). > > Given that each row update is completely independent of any other row, I > have the following questions: > > 1) Is there any way to set the command such that each row change is > committed as it is calculated? > 2) Is there some way to run this command in parallel in order to better > utilize multiple processor cores, other than manually breaking the data > into chunks and running a separate psql/update process for each chunk? > Honestly, manual parallelizing wouldn’t be too bad (there are a number of > logical segregations I can apply), I’m just wondering if there is a more > automatic option. > Yeah, I'd be inclined to do this in batches. If, for instance, the table has a nice primary key, then I'd capture the primary keys into a side table, and grab tuples from the side table to process in more bite-sized batches, say, of a few thousand tuples per batch. create table just_keys as select pk_column from big_historical_table; alter table just_keys add column processed boolean; create index jkpk on just_keys(pk_column) where (processed is null); then loop repeatedly along the lines... create temp table iteration as select pk_column from just_keys where processed is null limit 1000; [do update on big_historical_table where pk_column in (select pk_column from iteration)] update iteration set processed='true' where pk_column in (select pk_column from iteration); drop table iteration; Parallelization is absolutely an interesting idea; if you want to use 8 processes, then use a cycling sequence on the side table to spread tuples across the 8 processes, so that they can grab their own tuples and not block one another. In that case, more like... create temp sequence seq_procs start with 1 maxval 8 cycle; create temp table just_keys as select pk_column, false::boolean as processed, nextval('seq_procs') as batch_id from big_historical_table; The individual iterations then look for values in just_keys corresponding to their assigned batch number. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"