paultipper opened a new issue, #11953:
URL: https://github.com/apache/iceberg/issues/11953
### Apache Iceberg version
1.6.1
### Query engine
Spark
### Please describe the bug 🐞
Share
I'm trying to use the Apache Spark MERGE INTO command to add/update some
data from a source data frame into an Apache Iceberg table within an AWS Glue
table using an AWS Glue job running Spark 3.5. If the source data frame is
empty, then all of the existing data in the target table is deleted.
Here is a sample of the Python code I'm using to do this:
```
# df is a data frame of the source data, and is passed into this code block
df.createOrReplaceTempView("source_data")
# Get start year, month and day from start_date, which is a datetime object
passed into this code block
year = start_date.year
month = start_date.month
day = start_date.day
print(f"start_date: {start_date}, year: {year}, month: {month}, day: {day}")
# Generate the WHERE part of the statement
where_clause = f"WHERE year >= {year} AND (year > {year} OR month >=
{month}) AND (year > {year} OR month > {month} OR day >= {day})"
selected_df = spark.sql(f"SELECT * FROM source_data {where_clause}")
logger.info(f"New CSV rows selected for merging: {selected_df.count()}")
selected_df.createOrReplaceTempView("new_data")
MERGE INTO iceberg_catalog.db.target_table t
USING new_data AS s
ON (t.surrogate_key = s.surrogate_key)
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
```
Before the MERGE INTO operation, the target table contains 8246 rows, and
I've establised that the number of rows in the selected_df data frame was 0. My
expectation is that merging `selected_df` into the target table should leave
the target table with the same data as before, but I found that in fact that,
after the MERGE INTO operation, the target table was empty. As I say, my
assumption is that the MERGE INTO command will add any rows in `selected_df`
that do not already exist into the target table; that it will update any rows
that do exist, and will leave any rows that exist in the target table that are
not in `selected_df` in place; is my assumption incorrect?
### Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [X] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]