I believe I logged an issue first and I should get a response first. I was ignored.
Regards Did you know there are 8 million people in kashmir locked up in their homes by the Hindutwa (Indians) for 8 months. Now the whole planet is locked up in their homes. You didn't take notice of them either. you ignored them. Backbutton.co.uk ¯\_(ツ)_/¯ ♡۶Java♡۶RMI ♡۶ Make Use Method {MUM} makeuse.org <http://www.backbutton.co.uk> On Thu, 26 Mar 2020 at 17:24, Enrico Minack <m...@enrico.minack.dev> wrote: > Abhinav, > > you can repartition by your key, then sortWithinPartition, and the > groupByKey. Since data are already hash-partitioned by key, Spark should > not shuffle the data hence change the sort wihtin each partition: > > ds.repartition($"key").sortWithinPartitions($"code").groupBy($"key") > > Enrico > > Am 26.03.20 um 17:53 schrieb Ranjan, Abhinav: > > Hi, > > I have a dataframe which has data like: > > key | code | code_value > 1 | c1 | 11 > 1 | c2 | 12 > 1 | c2 | 9 > 1 | c3 | 12 > 1 | c2 | 13 > 1 | c2 | 14 > 1 | c4 | 12 > 1 | c2 | 15 > 1 | c1 | 12 > > > I need to group the data based on key and then apply some custom logic on > every of the value I got by grouping. So I did this: > > lets suppose it is in a dataframe df. > > *case class key_class(key: string, code: string, code_value: string)* > > > df > .as[key_class] > .groupByKey(_.key) > .mapGroups { > (x, groupedValues) => > val status = groupedValues.map(row => { > // do some custom logic on row > ("SUCCESS") > }).toList > > }.toDF("status") > > > The issue with above approach is the values I get after applying > groupByKey are not sorted/ordered. I want the values to be sorted by the > column 'code'. > > There is a way to do this: > > 1. get them in a list and then apply sort ==> this will result in OOM if > the iterartor is too big. > > 2. I think some how to apply the secondary sort, but problem with that > approach is I have to keep track of the key change. > > 3. sortWithinPartitions cannot be applied because groupBy will mess up the > order. > > 4. Another approach is: > > df > .as[key_class] > .sort("key").sort("code") > .map { > // do stuff here > } > > but here also I have to keep track of the key change within map function, > and sometimes this also overflows if the keys are skewed. > > > > *So is there any way in which I can get the values sorted after grouping > them by a key.??* > > > *Thanks,* > > > *Abhinav * > > >