mattmartin14 commented on PR #1534:
URL: https://github.com/apache/iceberg-python/pull/1534#issuecomment-2635043790
> @mattmartin14 more useful:
>
> ```
> df_1 = pd.DataFrame({'name': ["tom", "matt"]})
> tbl_1 = pa.Table.from_pandas(df_1)
>
> df_2 = pd.DataFrame({'name': ["tom", "harry"]})
> tbl_2 = pa.Table.from_pandas(df_2)
>
> mask = pa.compute.is_in(tbl_1["name"], value_set=tbl_2["name"])
> tbl_1.filter(mask)
> ```
Thanks @tscottcoombes1 and @Fokko - got the pyarrow filter to work for
identifying rows to insert (no datafusion needed). Code for that is below. Now
i'm moving onto figuring out the filter for rows to update, which will be a
little more tricky:
```python
def get_rows_to_insert(source_table: pa.Table, target_table: pa.Table,
join_cols: list) -> pa.Table:
source_filter_expr = None
for col in join_cols:
target_values = target_table.column(col).to_pylist()
expr = pc.field(col).isin(target_values)
if source_filter_expr is None:
source_filter_expr = expr
else:
source_filter_expr = source_filter_expr & expr
non_matching_expr = ~source_filter_expr
source_columns = set(source_table.column_names)
target_columns = set(target_table.column_names)
common_columns = source_columns.intersection(target_columns)
non_matching_rows =
source_table.filter(non_matching_expr).select(common_columns)
return non_matching_rows
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]