Hi All

I have a rdd, which i partition based on some key, and then can sc.runJob
for each partition.
 Inside this function, i assign each partition a unique key using following:

"%s_%s" % (id(part), int(round(time.time()))

This is to make sure that, each partition produces separate bookeeping stuff,

which can be aggregated by external system. However, I sometimes i
notice multiple

partition results pointing to same partition_id. Is this some issue due to the

way above code is serialized by Pyspark. What's the best way to define
a unique id

for each partition. I undestand that its same executor getting
multiple partitions to process,

but i would expect the above code to produce a unique id for each partition.



Regards
Sumit Chawla

Reply via email to