Ohad Raviv created SPARK-17662: ---------------------------------- Summary: Dedup UDAF Key: SPARK-17662 URL: https://issues.apache.org/jira/browse/SPARK-17662 Project: Spark Issue Type: New Feature Reporter: Ohad Raviv
We have a common use case od deduping a table in a creation order. For example, we have an event log of user actions. A user marks his favorite category from time to time. In our analytics we would like to know only the user's last favorite category. The data: user_id action_type value date 123 fav category 1 2016-02-01 123 fav category 4 2016-02-02 123 fav category 8 2016-02-03 123 fav category 2 2016-02-04 we would like to get only the last update by the date column. we could of-course do it in sql: select * from ( select *, row_number() over (partition by user_id,action_type order by date desc) as rnum from tbl) where rnum=1; but then, I believe it can't be optimized on the mappers side and we'll get all the data shuffled to the reducers instead of partially aggregated in the map side. We have written a UDAF for this, but then we have other issues - like blocking push-down-predicate for columns. do you have any idea for a proper solution? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org