westonpace commented on code in PR #14482:
URL: https://github.com/apache/arrow/pull/14482#discussion_r1003583810


##########
python/pyarrow/table.pxi:
##########
@@ -5282,6 +5282,7 @@ class TableGroupBy:
 list[tuple(str, str, FunctionOptions)]
             List of tuples made of aggregation column names followed
             by function names and optionally aggregation function options.
+            Pass empty list to imitate drop_duplicates pandas function.

Review Comment:
   It's not quite the same though.  Pandas `drop_duplicates` will keep columns 
that are not key columns. By default it will keep the first value in each 
group, though this is configurable.  For example:
   
   ```
   >>> tab = pa.Table.from_pydict({"x": [1, 1, 1, 2, 2], "y": ["a", "b", "c", 
"d", "e"]})
   >>> pa.TableGroupBy(tab, "x").aggregate([])
   pyarrow.Table
   x: int64
   ----
   x: [[1,2]]
   ```
   
   With `drop_duplicates` you would also get `y: [["a", "d"]]`.  You can kind 
of imitate this by using the `one` function which just picks some arbitrary 
value from a non-key column ("first" and "last" are difficult concepts within 
datasets at the moment).
   
   ```
   >>> pa.TableGroupBy(tab, "x").aggregate([("y", "one")])
   pyarrow.Table
   y_one: string
   x: int64
   ----
   y_one: [["a","d"]]
   x: [[1,2]]
   ```
   
   Either way, maybe this should be:
   
   ```suggestion
               Pass empty list to get a single row for each group.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to