Re: Stratified sampling with DataFrames

Xiangrui Meng Tue, 19 May 2015 13:15:01 -0700

You need to convert DataFrame to RDD, call sampleByKey, and then apply
the schema back to create DataFrame.


val df: DataFrame = ...
val schema = df.schema
val sampledRDD = df.rdd.keyBy(r => r.getAs[Int](0)).sampleByKey(...).values
val sampled = sqlContext.createDataFrame(sampledRDD, schema)

Hopefully this would be much easier in 1.5.

Best,
Xiangrui

On Mon, May 11, 2015 at 12:32 PM, Karthikeyan Muthukumar
<mkarthiksw...@gmail.com> wrote:
> Hi,
> I'm in Spark 1.3.0 and my data is in DataFrames.
> I need operations like sampleByKey(), sampleByKeyExact().
> I saw the JIRA "Add approximate stratified sampling to DataFrame"
> (https://issues.apache.org/jira/browse/SPARK-7157).
> That's targeted for Spark 1.5, till that comes through, whats the easiest
> way to accomplish the equivalent of sampleByKey() and sampleByKeyExact() on
> DataFrames.
> Thanks & Regards
> MK
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Stratified sampling with DataFrames

Reply via email to