Hi Julian,

Using this way to break up the queries, I am able to update about 1500 rows
per minute which will take over 100 days to complete, so I need to figure
out why this is slow, and if there is any faster way.

UPDATE table1 SET new_column = table1.new_column FROM table2 WHERE
table1.row_id = table2.row_id AND table2.row_id >= $1 AND table2.row_id <
$2.

Here is the explain from that:

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..39.78 rows=1 width=121)
   ->  Index Scan using table2_pkey on table2  (cost=0.00..19.88 rows=1
width=12)
         Index Cond: ((row_id >= $1) AND (row_id < $2))
   ->  Index Scan using table1_pkey on table1  (cost=0.00..19.90 rows=1
width=113)
         Index Cond: (table1.row_id = table2.row_id)
(5 rows)


Thanks,
--
Joshua Rubin


On Fri, Jul 16, 2010 at 1:05 PM, Joshua Rubin <jru...@esoft.com> wrote:

> Hi Julian,
>
> Sorry for the slow response. I think I will need to chop up the query some
> how, but have not yet found an efficient way to do that. row_id is the
> primary key in both tables, so that might work.
>
> Here is the explain:
> urls_jrubin_merged=# EXPLAIN UPDATE table1 SET row_id = table2.row_id FROM
> table2 WHERE table1.row_id = table2.row_id;
>                                       QUERY PLAN
>
>
> --------------------------------------------------------------------------------------
>  Merge Join  (cost=57257969.62..12983795937.97 rows=4308749788074
> width=121)
>    Merge Cond: (table2.row_id = table1.row_id)
>    ->  Sort  (cost=15885110.79..16029412.85 rows=288604128 width=8)
>          Sort Key: table2.row_id
>          ->  Seq Scan on table2  (cost=0.00..2137231.26 rows=288604128
> width=8)
>    ->  Materialize  (cost=41372858.83..42105903.14 rows=293217725
> width=121)
>          ->  Sort  (cost=41372858.83..41519467.69 rows=293217725 width=121)
>                Sort Key: table1.row_id
>                ->  Seq Scan on todo  (cost=0.00..5922587.45 rows=293217725
> width=121)
> (9 rows)
>
>
> Thanks,
> --
> Joshua Rubin
>
>
>
> On Tue, Jul 13, 2010 at 5:08 PM, Julian Mehnle <jul...@mehnle.net> wrote:
>
>> Joshua Rubin wrote:
>>
>> > I have two tables each with nearly 300M rows. There is a 1:1
>> > relationship between the two tables and they are almost always joined
>> > together in queries. The first table has many columns, the second has
>> > a foreign key to the primary key of the first table and one more
>> > column. It is expected that for every row in table1, there is a
>> > corresponding row in table2. We would like to just add the one column
>> > to the first table and drop the second table to allow us to index this
>> > extra column.
>> >
>> > This query would work after adding the column to the first table:
>> > UPDATE table1 SET new_column = table2.new_column FROM table2 WHERE
>> > table1.row_id = table2.row_id;
>> >
>> > However, this will take much too long, I have not successfully
>> > completed this on our staging server after running it for 3+ days.
>>
>> Can you get the query plan (EXPLAIN) of the update query?  My guess is the
>> join cost scales superlinearly.
>>
>> You might be able to chop this up into smaller UPDATEs by limiting the
>> rows to be updated in each round by the primary key.
>>
>> E.g.:
>>
>>  UPDATE table1 SET new_column = table2.new_column FROM table2
>>    WHERE
>>       table1.row_id = table2.row_id and
>>      table1.row_id >= 0e6 and table1.row_id < 1e6 and
>>      table2.row_id >= 0e6 and table2.row_id < 1e6;
>>
>> for a moving row_id window.
>>
>> This has helped me in the past with a similar scenario (where both tables
>> were partitioned by the PK, but it would presumably still work in the
>> unpartitioned case).
>>
>> -Julian
>>
>
>

Reply via email to