[ https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483481#comment-17483481 ]
Lance Dacey commented on ARROW-15474: ------------------------------------- Ahh, that would be great. Random is a bit risky for my use case since I generally care about the latest version. I found [this repository|https://github.com/TomScheffers/pyarrow_ops/tree/main/pyarrow_ops] which has a method to drop duplicates that I might be able to adopt in the meantime. I would need to digest exactly what is happening down below a bit more, but I think there are some compute functions like `pc.sort_indices`, `pc.unique`, etc that could probably be used to replace some of the numpy code. {code:python} def drop_duplicates(table, on=[], keep='first'): # Gather columns to arr arr = columns_to_array(table, (on if on else table.column_names)) # Groupify dic, counts, sort_idxs, bgn_idxs = groupify_array(arr) # Gather idxs if keep == 'last': idxs = (np.array(bgn_idxs) - 1)[1:].tolist() + [len(sort_idxs) - 1] elif keep == 'first': idxs = bgn_idxs elif keep == 'drop': idxs = [i for i, c in zip(bgn_idxs, counts) if c == 1] return table.take(sort_idxs[idxs]) def groupify_array(arr): # Input: Pyarrow/Numpy array # Output: # - 1. Unique values # - 2. Sort index # - 3. Count per unique # - 4. Begin index per unique dic, counts = np.unique(arr, return_counts=True) sort_idx = np.argsort(arr) return dic, counts, sort_idx, [0] + np.cumsum(counts)[:-1].tolist() def combine_column(table, name): return table.column(name).combine_chunks() f = np.vectorize(hash) def columns_to_array(table, columns): columns = ([columns] if isinstance(columns, str) else list(set(columns))) if len(columns) == 1: return f(combine_column(table, columns[0]).to_numpy(zero_copy_only=False)) else: values = [c.to_numpy() for c in table.select(columns).itercolumns()] return np.array(list(map(hash, zip(*values)))) {code} > [Python] Possibility of a table.drop_duplicates() function? > ----------------------------------------------------------- > > Key: ARROW-15474 > URL: https://issues.apache.org/jira/browse/ARROW-15474 > Project: Apache Arrow > Issue Type: Wish > Affects Versions: 6.0.1 > Reporter: Lance Dacey > Priority: Major > Fix For: 8.0.0 > > > I noticed that there is a group_by() and sort_by() function in the 7.0.0 > branch. Is it possible to include a drop_duplicates() function as well? > ||id||updated_at|| > |1|2022-01-01 04:23:57| > |2|2022-01-01 07:19:21| > |2|2022-01-10 22:14:01| > Something like this which would return a table without the second row in the > example above would be great. > I usually am reading an append-only dataset and then I need to report on > latest version of each row. To drop duplicates, I am temporarily converting > the append-only table to a pandas DataFrame, and then I convert it back to a > table and save a separate "latest-version" dataset. > {code:python} > table.sort_by(sorting=[("id", "ascending"), ("updated_at", > "ascending")]).drop_duplicates(subset=["id"] keep="last") > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)