subject:"How to guarantee dataset is split over unique partitions \(partitioned by a column value\)"

Re: How to guarantee dataset is split over unique partitions (partitioned by a column value)

2022-06-20 Thread Sean Owen

repartition() puts all values with the same key in one partition, but, multiple other keys can be in the same partition. It sounds like you want groupBy, not repartition, if you want to handle these separately. On Mon, Jun 20, 2022 at 8:26 AM DESCOTTE Loic - externe wrote: > Hi, > > > > I have

How to guarantee dataset is split over unique partitions (partitioned by a column value)

2022-06-20 Thread DESCOTTE Loic - externe

Hi, I have a data type like this : case class Data(col: String, ...) and a Dataset[Data] ds. Some rows have columns filled with value 'a', and other with value 'b', etc. I want to process separately all data with a 'a', and all data with a 'b'. But I also need to have all the 'a' in the