Re: [I] Apache Iceberg -- Row level duplicates elimination without any data loss similar to df.dropDuplicates() method [iceberg]

via GitHub Thu, 27 Mar 2025 02:00:49 -0700


Muskan-m commented on issue #12659:
URL: https://github.com/apache/iceberg/issues/12659#issuecomment-2757242828


   
   
   As a solution I am currently achieving it in below fashion
   
   1. create branch on iceberg table
   2. create data frame just selecting data from branch and apply window 
function Row_NUMBER() and select only records where rowNo is 2
   3. Execute MERGE INTO to delete the records from main table using data frame 
-- Upon it's completion execute step 4
   4. Execute MERGE INTO to insert distinct records from branch -- Upon it's 
completion execute step 5
   5. Drop the branch
   
   `spark.sql(
             """CREATE TABLE IF NOT EXISTS db.dedup_demo_part_drop
                     (id BIGINT, name STRING, role STRING, salary 
double,joining_date STRING) USING iceberg PARTITIONED BY (joining_date)"""
         )
         
         
   spark.sql(""" INSERT INTO db.dedup_demo values (1, 'Harry', 'Software 
Engineer', 25000,"2025-03-01"), (2, 'John', 'Marketing Ops', 
17000,"2025-03-01")""")    
   
   
   spark.sql("ALTER TABLE db.dedup_demo CREATE BRANCH duplicationTest")
   
   
   spark.sql(""" describe db.dedup_demo """).show(false)
   
   
   val df1 = spark.sql(""" select * from (SELECT  id, name, role, salary,
                     ROW_NUMBER() OVER (PARTITION BY id, name, role, salary 
ORDER BY id, name, role, salary DESC) AS rowNo
                   FROM  db.dedup_demo VERSION AS OF 'duplicationTest') where 
rowNo = 2 """)
                   
   df1.createOrReplaceTempView("source_deduplicate")               
   
   
   
   spark.sql(""" MERGE INTO db.dedup_demo AS target
         USING source_deduplicate AS source
         ON target.id = source.id
            AND target.name = source.name
            AND target.role = source.role
            AND target.salary = source.salary
         WHEN MATCHED  THEN
             DELETE
         """)
         
         
   spark.sql("SELECT * FROM db.dedup_demo VERSION AS OF 
'duplicationTest'").show(false)      
   
   
   
   spark.sql(""" MERGE INTO db.dedup_demo AS target
         USING (select distinct * from  db.dedup_demo VERSION AS OF 
'duplicationTest') AS source
         ON target.id = source.id
            AND target.name = source.name
            AND target.role = source.role
            AND target.salary = source.salary
         WHEN NOT MATCHED  THEN
             INSERT *
         """)
   
   spark.sql(s"""ALTER TABLE ${tblName} DROP BRANCH ${branchName}""")`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Apache Iceberg -- Row level duplicates elimination without any data loss similar to df.dropDuplicates() method [iceberg]

Reply via email to