Re: Dataframe random permutation?

Peter Rudenko Mon, 01 Jun 2015 13:55:06 -0700

Hi Cesar,
try to do:

hc.createDataFrame(df.rdd.coalesce(NUM_PARTITIONS, shuffle =true),df.schema) 
It's a bit inefficient, but should shuffle the whole dataframe.


Thanks,
Peter Rudenko
On 2015-06-01 22:49, Cesar Flores wrote:

I would like to know what will be the best approach to randomlypermute a Data Frame. I have tried:
df.sample(false,1.0,x).show(100)
where x is the seed. However, it gives the same result no matter thevalue of x (it only gives different values when the fraction issmaller than 1.0) . I have tried also:
hc.createDataFrame(df.rdd.repartition(100),df.schema)
which appears to be a random permutation. Can some one confirm me thatthe last line is in fact a random permutation, or point me out to abetter approach?
Thanks!!!!
--
Cesar Flores

Re: Dataframe random permutation?

Reply via email to