[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706817#comment-14706817 ]
Koert Kuipers commented on SPARK-3655: -------------------------------------- hey nick, i believe your problem sounds like a good fit for GroupSorted. you can process the sessions per machine sorted by timestamp as an iterator using mapStreamByKey, keeping state as you iterate to detect status-code changes and then assign some identifier to break up sessions accordingly. the only tricky part is keeping state as you iterate. i am not sure of my suggested way below is the best. pseudo scala code: rdd[(MachineId, Session)].groupSort(numPartitions, timestampOrdering).mapStreamByKey{ iterator => var uuid = UUID.randomUUID.toString iterator.map{ session => if (session.statusCode == 123) uuid = UUID.randomUUID.toString (uuid, session) } } the way to do this without a mutable state would be something like Iterator.scanLeft > Support sorting of values in addition to keys (i.e. secondary sort) > ------------------------------------------------------------------- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Affects Versions: 1.1.0, 1.2.0 > Reporter: koert kuipers > Assignee: Koert Kuipers > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org