How to guarantee dataset is split over unique partitions (partitioned by a column value)

DESCOTTE Loic - externe Mon, 20 Jun 2022 06:25:48 -0700

Hi,

I have a data type like this :
   case class Data(col: String, ...)


and a Dataset[Data] ds. Some rows have columns filled with value 'a', and other 
with value 'b', etc.

I want to process separately all data with a 'a', and all data with a 'b'. But 
I also need to have all the 'a' in the same partition.

If I do : ds.repartition(col("col")).mapPartition(data => ???)

Is it guaranteed by default that I will have all the 'a' in a single partition, 
and no 'b' mixed with it in this partition?

My understanding is that all the 'a' will be in the same partition, but 'a' and 
'b' may be mixed. So I should do :

val nbDistinct = ds.select("col").distinct.count
ds.repartition(col("col")).mapPartition{ data =>
   // split mixed values in a single partition with group by :
   data.groupBy(_.col).flatMap { case (col, rows) => ??? }
}

Is that correct?


I can also do this to force the number of partitions with the number of 
distinct values :
val nbDistinct = ds.select("col").distinct.count
ds.repartition(nbDistinct , col("col")).mapPartition(data => ???)

But is it useful?
But it adds an action that may be expensive in some cases, and sometimes it 
seems to use less partitions than it should.

Example (spark shell) :
scala> (Range(0,10).map(_ => Data("a")) union Range(0,10).map(_ => Data("b")) 
union Range(0,10).map(_ => 
Data("c"))).toList.toDS.repartition(col("col")).rdd.foreachPartition(part => 
println(part.mkString))

output :
Data(b)Data(b)Data(b)Data(b)Data(b)Data(b)Data(b)Data(b)Data(b)Data(b)
Data(a)Data(a)Data(a)Data(a)Data(a)Data(a)Data(a)Data(a)Data(a)Data(a)
Data(c)Data(c)Data(c)Data(c)Data(c)Data(c)Data(c)Data(c)Data(c)Data(c)

I have 3 partitions + a lot of empty partitions (blank lines)
If I do :

scala> (Range(0,10).map(_ => Data("a")) union Range(0,10).map(_ => Data("b")) 
union Range(0,10).map(_ => 
Data("c"))).toList.toDS.repartition(3,col("col")).rdd.foreachPartition(part => 
println(part.mkString))
then output is :
Data(a)Data(a)Data(a)Data(a)Data(a)Data(a)Data(a)Data(a)Data(a)Data(a)Data(b)Data(b)Data(b)Data(b)Data(b)Data(b)Data(b)Data(b)Data(b)Data(b)Data(c)Data(c)Data(c)Data(c)Data(c)Data(c)Data(c)Data(c)Data(c)Data(c)

I would like to have 3 partitions with all 'a' (and only the 'a'), then all 'b' 
etc. but I have 2 empty partitions and a partition with all the 'a', 'b' and 
'c' of my dataset.

So, is there a good way to guarantee a dataset is split over unique partitions 
for a column value?

Thanks,
Loïc



Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.
____________________________________________________

This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.

How to guarantee dataset is split over unique partitions (partitioned by a column value)

Reply via email to