[ https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632256#comment-17632256 ]
Jacek Pliszka edited comment on ARROW-15474 at 11/11/22 11:04 AM: ------------------------------------------------------------------ [~westonpace] maybe approach similar to what I proposed, but in better version whould work? We need compute function that for given array of values returns the index of the first/last appearance. Then all batches can be processed in parallel and at the end merged exactly as you described. Once we have index of the first/last appearance - we can use compute.take to have the output table. Maybe even ordering function can be specified so there would be no need to sort the array a priori. was (Author: jacek.pliszka): [~westonpace] maybe approach similar to what I proposed, but in better version whould work? We need compute function that for given array of values returns the index of the first/last appearance. Then all batches can be processed in parallel and at the end merged exactly as you described. Once we have index of the first/last appearance - we can use compute.take to have the output table. > [Python] Possibility of a table.drop_duplicates() function? > ----------------------------------------------------------- > > Key: ARROW-15474 > URL: https://issues.apache.org/jira/browse/ARROW-15474 > Project: Apache Arrow > Issue Type: Wish > Components: Python > Affects Versions: 6.0.1 > Reporter: Lance Dacey > Priority: Major > > I noticed that there is a group_by() and sort_by() function in the 7.0.0 > branch. Is it possible to include a drop_duplicates() function as well? > ||id||updated_at|| > |1|2022-01-01 04:23:57| > |2|2022-01-01 07:19:21| > |2|2022-01-10 22:14:01| > Something like this which would return a table without the second row in the > example above would be great. > I usually am reading an append-only dataset and then I need to report on > latest version of each row. To drop duplicates, I am temporarily converting > the append-only table to a pandas DataFrame, and then I convert it back to a > table and save a separate "latest-version" dataset. > {code:python} > table.sort_by(sorting=[("id", "ascending"), ("updated_at", > "ascending")]).drop_duplicates(subset=["id"] keep="last") > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)