kosiew opened a new pull request, #1185:
URL: https://github.com/apache/datafusion-python/pull/1185

   ## Which issue does this PR close?
   
   - Closes #1173
   
   ## Rationale for this change
   
   This change improves usability and interoperability by adding an optional 
`deduplicate=True` flag to DataFrame joins. In many real-world datasets, join 
keys exist in both tables with the same name. Prior to this PR, this resulted 
in conflicting column names in the output and required manual disambiguation or 
column renaming.
   
   This feature aligns with the behavior in other DataFrame libraries like 
PySpark, making joins easier and more intuitive for users by optionally 
removing duplicate join columns from the right-hand side.
   
   ## What changes are included in this PR?
   
   - Introduced a `deduplicate` argument to the `DataFrame.join()` method.
   - Refactored join key resolution into a `_prepare_join` helper method.
   - Implemented `_deduplicate_right` logic to rename right-hand join keys with 
unique aliases.
   - Automatically drops aliased join columns after use.
   - Applies `coalesce` to resolve `null` values in outer/right joins.
   - Updated the user guide with documentation and examples of disambiguation 
and deduplication.
   - Added comprehensive tests for:
     - Basic deduplication
     - Multi-column joins with deduplication
     - Select behavior post-join
     - All supported join types (inner, left, right, full)
   
   ## Are these changes tested?
   
   ✅ Yes. Tests are included to verify:
   - Join outputs match expectations with and without deduplication.
   - Correct schema after deduplication.
   - Select operations behave as expected post-join.
   - All supported join types are covered.
   
   ## Are there any user-facing changes?
   
   ✅ Yes. A new `deduplicate` keyword argument is available for 
`DataFrame.join()`. When enabled, it automatically removes duplicate join 
columns from the right DataFrame, simplifying common workflows and avoiding 
column naming conflicts.
   
   This feature is backward-compatible and opt-in.
   
   📘 User documentation has been updated with detailed usage examples, best 
practices, and behavior notes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to