[ https://issues.apache.org/jira/browse/SPARK-28975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ruiliang updated SPARK-28975: ----------------------------- Attachment: QQ图片20190905011256.png > How do I overwrite a piece of data and recalculate it? > ------------------------------------------------------ > > Key: SPARK-28975 > URL: https://issues.apache.org/jira/browse/SPARK-28975 > Project: Spark > Issue Type: Question > Components: Spark Core > Affects Versions: 2.3.0 > Reporter: ruiliang > Priority: Blocker > Labels: spark > Attachments: QQ图片20190905011256.png > > Original Estimate: 12h > Remaining Estimate: 12h > > 翻译: > I have a requirement to make real-time statistics of the total amount and > quantity of today's orders, but there will be repeated order ids in the > pushed data, so we need to repeat according to the order ID and order time, > take the latest order at the order time, and overwrite the previous order > data.And then recalculate. > I see the documentation has this function, > > / / Without watermark using guid column > streamingDf. DropDuplicates (" guid)" > > But this one doesn't add duplicate data, but it doesn't overwrite the old > data, so I want the new data to overwrite the old data, and then I'm going to > recalculate sum and things like that, but I don't find an interface for this > function, right? > I also thought StructuredSessionization this case, this is to maintain the > state of a single id, in this case could you calculate all total online > time?So if I want state: GroupState[SessionInfo]) ->sum(durationMs), is there > any other solution?thank you > > 原: > 我有一个需求,时实统计今日订单的总金额和总数量,但推送的数据会有订单ID重复,需要根据订单ID和订单时间来去重,取订单时间最新的一条订单,覆盖掉之前的订单数据。然后重新计算。 > 我看到文档有这个功能, > {{// Without watermark using guid columnstreamingDf.dropDuplicates("guid")}} > 但这个是不增加重复数据,但不会去覆盖旧数据,我想让新数据覆盖掉旧数据,然后在重新计算 sum(money) 类似这样,但是没有找到这个功能的接口可以用? > 我也想到过 StructuredSessionization 这个例子,这个是维护单个id的状态,在这个例子中能否算出全部总在线时长呢?类似我要 > state: GroupState[SessionInfo]) ->sum(durationMs) 这样的计算,有没有其它解决方案呢?谢谢 > !image-2019-09-05-01-04-31-308.png!{{}} -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org