On Sat, Feb 3, 2024 at 8:55 AM Lok P <loknath...@gmail.com> wrote:

> Apology. One correction, the query is like below. I. E filter will be on
> on ctid which I believe is equivalent of rowid in oracle and we will not
> need the index on Id column then.
>
>  But, it still runs long, so thinking any other way to make the duplicate
> removal faster?
>
> Also wondering , the index creation which took ~2.5hrs+ , would that have
> been made faster any possible way by allowing more db resource through some
> session level db parameter setting?
>
> create table TAB1_New
> as
> SELECT  * from TAB1 A
> where CTID in
>       (select min(CTID) from TAB1
>       group by ID having count(ID)>=1 );
>
>
> On Sat, Feb 3, 2024 at 5:50 PM Lok P <loknath...@gmail.com> wrote:
>
>> Hello All,
>> A non partitioned table having ~4.8 billion rows in it and having data
>> size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has
>> got approx ~1billion+ duplicate rows inserted in it and we want to get the
>> duplicate data removed for this table and create a PK/unique constraint
>> back so as to not have the duplicate values in future. We are struggling to
>> do the same.
>>
>> Teammates suggested doing this using CTAS method, i.e. create a new table
>> with the unique record set and then drop the main table. Something as below
>>
>> create table TAB1_New
>> as
>> SELECT  * from TAB1 A
>> where ID in
>>       (select min(ID) from TAB1
>>       group by ID having count(ID)>=1 );
>>
>> But for the above to work faster , they mentioned to have an index
>> created on the column using which the duplicate check will be performed i.e
>> ID column. So, creating the index itself took ~2hrs+ and the index size now
>> shows as ~116GB.
>>
>>
>> *Create index idx1 on TAB1(ID)*
>> And now running the SELECT subquery part of the CTAS statement to see if
>> its giving the correct unique records count. It ran for 2.5 hrs and then we
>> killed it. Below is the plan for the same.
>>
>> explain
>> *select  min(ID) from TAB1 A        group by ID having count(ID)>=1*
>>
>> GroupAggregate  (cost=0.71..6025790113.87 rows=29432861 width=46)
>>   Group Key: ID
>>   Filter: (count(ID) >= 1)
>>   ->  Index Only Scan using idx1 on TAB1 a  (cost=0.71..5988060903.17
>> rows=4883397120 width=14)
>>
>> I want to understand if by any way this can be done faster . Also I am
>> worried that creating PK constraint/index back after deleting the duplicate
>> is also going to run forever. Is there any way we can make these heavy
>> operations faster on postgre by facilitating more database resources
>> through some parameter setup, like parallel hint etc? We have pg_hint_plan
>> extension added, but not seeing the parallel hint enforced when adding it
>> to the query.
>>
>> In Oracle we have Parallel hints, Direct path read/write for faster
>> read/write operations, parallel index scan etc. available, if anything
>> similar to that available in aurora postgre to facilitate more
>> horsepower and speed up the batch operations. And , how can we monitor
>> progress of any running query ?
>>  Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
>> 256GB RAM. PG version 15.4.
>>
>
 Aurora is not Postgresql, so configurations might not work.  Having said
that...
https://www.postgresql.org/docs/15t/how-parallel-query-works.html

And have you analyzed the table lately?  Also, what's your work_mem
and maintenance_work_mem?

Reply via email to