[ 
https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483481#comment-17483481
 ] 

Lance Dacey commented on ARROW-15474:
-------------------------------------

Ahh, that would be great. Random is a bit risky for my use case since I 
generally care about the latest version.

I found [this 
repository|https://github.com/TomScheffers/pyarrow_ops/tree/main/pyarrow_ops] 
which has a method to drop duplicates that I might be able to adopt in the 
meantime. I would need to digest exactly what is happening down below a bit 
more, but I think there are some compute functions like `pc.sort_indices`,  
`pc.unique`, etc that could probably be used to replace some of the numpy code. 

{code:python}
def drop_duplicates(table, on=[], keep='first'):
    # Gather columns to arr
    arr = columns_to_array(table, (on if on else table.column_names))

    # Groupify
    dic, counts, sort_idxs, bgn_idxs = groupify_array(arr)

    # Gather idxs
    if keep == 'last':
        idxs = (np.array(bgn_idxs) - 1)[1:].tolist() + [len(sort_idxs) - 1]
    elif keep == 'first':
        idxs = bgn_idxs
    elif keep == 'drop':
        idxs = [i for i, c in zip(bgn_idxs, counts) if c == 1]
    return table.take(sort_idxs[idxs])

def groupify_array(arr):
    # Input: Pyarrow/Numpy array
    # Output:
    #   - 1. Unique values
    #   - 2. Sort index
    #   - 3. Count per unique
    #   - 4. Begin index per unique
    dic, counts = np.unique(arr, return_counts=True)
    sort_idx = np.argsort(arr)
    return dic, counts, sort_idx, [0] + np.cumsum(counts)[:-1].tolist()

def combine_column(table, name):
    return table.column(name).combine_chunks()

f = np.vectorize(hash)
def columns_to_array(table, columns):
    columns = ([columns] if isinstance(columns, str) else list(set(columns)))
    if len(columns) == 1:
        return f(combine_column(table, 
columns[0]).to_numpy(zero_copy_only=False))
    else:
        values = [c.to_numpy() for c in table.select(columns).itercolumns()]
        return np.array(list(map(hash, zip(*values))))
{code}
 

> [Python] Possibility of a table.drop_duplicates() function?
> -----------------------------------------------------------
>
>                 Key: ARROW-15474
>                 URL: https://issues.apache.org/jira/browse/ARROW-15474
>             Project: Apache Arrow
>          Issue Type: Wish
>    Affects Versions: 6.0.1
>            Reporter: Lance Dacey
>            Priority: Major
>             Fix For: 8.0.0
>
>
> I noticed that there is a group_by() and sort_by() function in the 7.0.0 
> branch. Is it possible to include a drop_duplicates() function as well? 
> ||id||updated_at||
> |1|2022-01-01 04:23:57|
> |2|2022-01-01 07:19:21|
> |2|2022-01-10 22:14:01|
> Something like this which would return a table without the second row in the 
> example above would be great. 
> I usually am reading an append-only dataset and then I need to report on 
> latest version of each row. To drop duplicates, I am temporarily converting 
> the append-only table to a pandas DataFrame, and then I convert it back to a 
> table and save a separate "latest-version" dataset.
> {code:python}
> table.sort_by(sorting=[("id", "ascending"), ("updated_at", 
> "ascending")]).drop_duplicates(subset=["id"] keep="last")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to