Muskan-m commented on issue #12659:
URL: https://github.com/apache/iceberg/issues/12659#issuecomment-2757242828
As a solution I am currently achieving it in below fashion
1. create branch on iceberg table
2. create data frame just selecting data from branch and apply window
function Row_NUMBER() and select only records where rowNo is 2
3. Execute MERGE INTO to delete the records from main table using data frame
-- Upon it's completion execute step 4
4. Execute MERGE INTO to insert distinct records from branch -- Upon it's
completion execute step 5
5. Drop the branch
`spark.sql(
"""CREATE TABLE IF NOT EXISTS db.dedup_demo_part_drop
(id BIGINT, name STRING, role STRING, salary
double,joining_date STRING) USING iceberg PARTITIONED BY (joining_date)"""
)
spark.sql(""" INSERT INTO db.dedup_demo values (1, 'Harry', 'Software
Engineer', 25000,"2025-03-01"), (2, 'John', 'Marketing Ops',
17000,"2025-03-01")""")
spark.sql("ALTER TABLE db.dedup_demo CREATE BRANCH duplicationTest")
spark.sql(""" describe db.dedup_demo """).show(false)
val df1 = spark.sql(""" select * from (SELECT id, name, role, salary,
ROW_NUMBER() OVER (PARTITION BY id, name, role, salary
ORDER BY id, name, role, salary DESC) AS rowNo
FROM db.dedup_demo VERSION AS OF 'duplicationTest') where
rowNo = 2 """)
df1.createOrReplaceTempView("source_deduplicate")
spark.sql(""" MERGE INTO db.dedup_demo AS target
USING source_deduplicate AS source
ON target.id = source.id
AND target.name = source.name
AND target.role = source.role
AND target.salary = source.salary
WHEN MATCHED THEN
DELETE
""")
spark.sql("SELECT * FROM db.dedup_demo VERSION AS OF
'duplicationTest'").show(false)
spark.sql(""" MERGE INTO db.dedup_demo AS target
USING (select distinct * from db.dedup_demo VERSION AS OF
'duplicationTest') AS source
ON target.id = source.id
AND target.name = source.name
AND target.role = source.role
AND target.salary = source.salary
WHEN NOT MATCHED THEN
INSERT *
""")
spark.sql(s"""ALTER TABLE ${tblName} DROP BRANCH ${branchName}""")`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]