Dataframe random permutation?

2015-06-01 Thread Cesar Flores
I would like to know what will be the best approach to randomly permute a
Data Frame. I have tried:

df.sample(false,1.0,x).show(100)

where x is the seed. However, it gives the same result no matter the value
of x (it only gives different values when the fraction is smaller than 1.0)
. I have tried also:

hc.createDataFrame(df.rdd.repartition(100),df.schema)

which appears to be a random permutation. Can some one confirm me that the
last line is in fact a random permutation, or point me out to a better
approach?


Thanks
-- 
Cesar Flores


Re: Dataframe random permutation?

2015-06-01 Thread Peter Rudenko

Hi Cesar,
try to do:

hc.createDataFrame(df.rdd.coalesce(NUM_PARTITIONS, shuffle =true),df.schema) 
It's a bit inefficient, but should shuffle the whole dataframe.

Thanks,
Peter Rudenko
On 2015-06-01 22:49, Cesar Flores wrote:


I would like to know what will be the best approach to randomly 
permute a Data Frame. I have tried:


df.sample(false,1.0,x).show(100)

where x is the seed. However, it gives the same result no matter the 
value of x (it only gives different values when the fraction is 
smaller than 1.0) . I have tried also:


hc.createDataFrame(df.rdd.repartition(100),df.schema)

which appears to be a random permutation. Can some one confirm me that 
the last line is in fact a random permutation, or point me out to a 
better approach?



Thanks
--
Cesar Flores