Re: How to guarantee dataset is split over unique partitions (partitioned by a column value)

2022-06-20 Thread Sean Owen
repartition() puts all values with the same key in one partition, but, multiple other keys can be in the same partition. It sounds like you want groupBy, not repartition, if you want to handle these separately. On Mon, Jun 20, 2022 at 8:26 AM DESCOTTE Loic - externe wrote: > Hi, > > > > I have

How to guarantee dataset is split over unique partitions (partitioned by a column value)

2022-06-20 Thread DESCOTTE Loic - externe
Hi, I have a data type like this : case class Data(col: String, ...) and a Dataset[Data] ds. Some rows have columns filled with value 'a', and other with value 'b', etc. I want to process separately all data with a 'a', and all data with a 'b'. But I also need to have all the 'a' in the

Re: How reading works?

2022-06-20 Thread Sid
Hi Team, Can somebody help? Thanks, Sid On Sun, Jun 19, 2022 at 3:51 PM Sid wrote: > Hi, > > I already have a partitioned JSON dataset in s3 like the below: > > edl_timestamp=2022090800 > > Now, the problem is, in the earlier 10 days of data collection there was a > duplicate columns