Re: Moving delta data faster

Adrian Klaver Sat, 06 Apr 2024 13:55:41 -0700

On 4/6/24 13:04, yudhi s wrote:

On Sat, Apr 6, 2024 at 10:25 PM Adrian Klaver <[email protected]<mailto:[email protected]>> wrote:
    Your original problem description was:

    "Then subsequently these rows will be inserted/updated based on the
    delta number of rows that got inserted/updated in the source database.
    In some cases these changed data can flow multiple times per day to the
    downstream i.e. postgres database and in other cases once daily."

    If the above is not a hard rule, then yes up to some point just
    replacing the data in mass would be the simplest/fastest method. You
    could cut a step out by doing something like TRUNCATE target_tab and
    then COPY target_tab FROM 'source.csv' bypassing the INSERT INTO
    source_tab.
Yes, actually i didn't realize that truncate table transactional/onlinehere in postgres. In other databases like Oracle its downtime for theread queries on the target table, as data will be vanished from thetarget table post truncate(until the data load happens) and those areauto commit. Thanks Veem for sharing that option.
I also think that truncate will be faster if the changes/delta islarge , but if its handful of rows like <5%of the rows in the table thenUpsert/Merge will be better performant. And also the down side of thetruncate option is, it does ask to bring/export all the data fromsource to the S3 file which may take longer as compared to bringing justthe delta records. Correct me if I'm wrong.

Since you still have not specified how the data is stored in S3 and howyou propose to move them into Postgres I can't really answer.

However I am still not able to understand why the upsert is lessperformant than merge, could you throw some light on this please?


I have no idea how this works in the code, but my suspicion is it is due
to the following:

https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT

"The optional ON CONFLICT clause specifies an alternative action toraising a unique violation or exclusion constraint violation error. Foreach individual row proposed for insertion, either the insertionproceeds, or, if an arbiter constraint or index specified byconflict_target is violated, the alternative conflict_action is taken.ON CONFLICT DO NOTHING simply avoids inserting a row as its alternativeaction. ON CONFLICT DO UPDATE updates the existing row that conflictswith the row proposed for insertion as its alternative action."


vs this:

"First, the MERGE command performs a join from data_source totarget_table_name producing zero or more candidate change rows. For eachcandidate change row, the status of MATCHED or NOT MATCHED is set justonce, after which WHEN clauses are evaluated in the order specified. Foreach candidate change row, the first clause to evaluate as true isexecuted. No more than one WHEN clause is executed for any candidatechange row."

Where ON CONFLICT attempts the INSERT then on failure does the UPDATEfor the ON CONFLICT DO UPDATE case. MERGE on the hand evaluates based onthe join condition(ON tbl1.fld =tbl2.fld) and then based on MATCH/NOTMATCHED takes the appropriate action for the first WHEN match. In otherwords it goes directly to the appropriate action.


--
Adrian Klaver
[email protected]

Re: Moving delta data faster

Reply via email to