[ https://issues.apache.org/jira/browse/SPARK-17662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671401#comment-15671401 ]
Herman van Hovell commented on SPARK-17662: ------------------------------------------- This is more of a question for the user list or stack overflow. So I am closing this. BTW: I would use max, for example: {noformat} select user_id, action_type, max(struct(date, *)) last_record from tbl group by 1,2 {noformat} > Dedup UDAF > ---------- > > Key: SPARK-17662 > URL: https://issues.apache.org/jira/browse/SPARK-17662 > Project: Spark > Issue Type: New Feature > Reporter: Ohad Raviv > > We have a common use case od deduping a table in a creation order. > For example, we have an event log of user actions. A user marks his favorite > category from time to time. > In our analytics we would like to know only the user's last favorite category. > The data: > user_id action_type value date > 123 fav category 1 2016-02-01 > 123 fav category 4 2016-02-02 > 123 fav category 8 2016-02-03 > 123 fav category 2 2016-02-04 > we would like to get only the last update by the date column. > we could of-course do it in sql: > select * from ( > select *, row_number() over (partition by user_id,action_type order by date > desc) as rnum from tbl) > where rnum=1; > but then, I believe it can't be optimized on the mappers side and we'll get > all the data shuffled to the reducers instead of partially aggregated in the > map side. > We have written a UDAF for this, but then we have other issues - like > blocking push-down-predicate for columns. > do you have any idea for a proper solution? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org