[ 
https://issues.apache.org/jira/browse/SPARK-28975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ruiliang updated SPARK-28975:
-----------------------------
    Attachment: QQ图片20190905011256.png

> How do I overwrite a piece of data and recalculate it?
> ------------------------------------------------------
>
>                 Key: SPARK-28975
>                 URL: https://issues.apache.org/jira/browse/SPARK-28975
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 2.3.0
>            Reporter: ruiliang
>            Priority: Blocker
>              Labels: spark
>         Attachments: QQ图片20190905011256.png
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
>  翻译:
> I have a requirement to make real-time statistics of the total amount and 
> quantity of today's orders, but there will be repeated order ids in the 
> pushed data, so we need to repeat according to the order ID and order time, 
> take the latest order at the order time, and overwrite the previous order 
> data.And then recalculate.
> I see the documentation has this function,
>  
> / / Without watermark using guid column
>  streamingDf. DropDuplicates (" guid)"
>  
> But this one doesn't add duplicate data, but it doesn't overwrite the old 
> data, so I want the new data to overwrite the old data, and then I'm going to 
> recalculate sum and things like that, but I don't find an interface for this 
> function, right?
> I also thought StructuredSessionization this case, this is to maintain the 
> state of a single id, in this case could you calculate all total online 
> time?So if I want state: GroupState[SessionInfo]) ->sum(durationMs), is there 
> any other solution?thank you
>  
> 原:
> 我有一个需求,时实统计今日订单的总金额和总数量,但推送的数据会有订单ID重复,需要根据订单ID和订单时间来去重,取订单时间最新的一条订单,覆盖掉之前的订单数据。然后重新计算。
> 我看到文档有这个功能,
> {{// Without watermark using guid columnstreamingDf.dropDuplicates("guid")}}
> 但这个是不增加重复数据,但不会去覆盖旧数据,我想让新数据覆盖掉旧数据,然后在重新计算 sum(money) 类似这样,但是没有找到这个功能的接口可以用?
> 我也想到过 StructuredSessionization 这个例子,这个是维护单个id的状态,在这个例子中能否算出全部总在线时长呢?类似我要 
> state: GroupState[SessionInfo]) ->sum(durationMs) 这样的计算,有没有其它解决方案呢?谢谢
> !image-2019-09-05-01-04-31-308.png!{{}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to