[jira] [Commented] (SPARK-17662) Dedup UDAF

Herman van Hovell (JIRA) Wed, 16 Nov 2016 11:43:35 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-17662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671401#comment-15671401
 ]


Herman van Hovell commented on SPARK-17662:
-------------------------------------------

This is more of a question for the user list or stack overflow. So I am closing 
this.

BTW: I would use max, for example: 
{noformat}
select user_id,
       action_type,
       max(struct(date, *)) last_record
from   tbl
group by 1,2
{noformat}

> Dedup UDAF
> ----------
>
>                 Key: SPARK-17662
>                 URL: https://issues.apache.org/jira/browse/SPARK-17662
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Ohad Raviv
>
> We have a common use case od deduping a table in a creation order.
> For example, we have an event log of user actions. A user marks his favorite 
> category from time to time.
> In our analytics we would like to know only the user's last favorite category.
> The data:
> user_id    action_type    value    date    
> 123          fav category   1           2016-02-01
> 123          fav category   4           2016-02-02
> 123          fav category   8           2016-02-03
> 123          fav category   2           2016-02-04
> we would like to get only the last update by the date column.
> we could of-course do it in sql:
> select * from (
> select *, row_number() over (partition by user_id,action_type order by date 
> desc) as rnum from tbl)
> where rnum=1;
> but then, I believe it can't be optimized on the mappers side and we'll get 
> all the data shuffled to the reducers instead of partially aggregated in the 
> map side.
> We have written a UDAF for this, but then we have other issues - like 
> blocking push-down-predicate for columns.
> do you have any idea for a proper solution?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17662) Dedup UDAF

Reply via email to