kosiew opened a new pull request, #1185: URL: https://github.com/apache/datafusion-python/pull/1185
## Which issue does this PR close? - Closes #1173 ## Rationale for this change This change improves usability and interoperability by adding an optional `deduplicate=True` flag to DataFrame joins. In many real-world datasets, join keys exist in both tables with the same name. Prior to this PR, this resulted in conflicting column names in the output and required manual disambiguation or column renaming. This feature aligns with the behavior in other DataFrame libraries like PySpark, making joins easier and more intuitive for users by optionally removing duplicate join columns from the right-hand side. ## What changes are included in this PR? - Introduced a `deduplicate` argument to the `DataFrame.join()` method. - Refactored join key resolution into a `_prepare_join` helper method. - Implemented `_deduplicate_right` logic to rename right-hand join keys with unique aliases. - Automatically drops aliased join columns after use. - Applies `coalesce` to resolve `null` values in outer/right joins. - Updated the user guide with documentation and examples of disambiguation and deduplication. - Added comprehensive tests for: - Basic deduplication - Multi-column joins with deduplication - Select behavior post-join - All supported join types (inner, left, right, full) ## Are these changes tested? ✅ Yes. Tests are included to verify: - Join outputs match expectations with and without deduplication. - Correct schema after deduplication. - Select operations behave as expected post-join. - All supported join types are covered. ## Are there any user-facing changes? ✅ Yes. A new `deduplicate` keyword argument is available for `DataFrame.join()`. When enabled, it automatically removes duplicate join columns from the right DataFrame, simplifying common workflows and avoiding column naming conflicts. This feature is backward-compatible and opt-in. 📘 User documentation has been updated with detailed usage examples, best practices, and behavior notes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org