In my opinion aggragate+flatMap would work faster as it would make less
passes through the data. Would work like this:

import random

def agg(x,y):
    x[0] += 1 if not y[1] else 0
    x[1] += 1 if y[1] else 0
    return x

# Source data
rdd  = sc.parallelize(xrange(100000), 5)
rdd2 = rdd.map(lambda x: (x, random.choice([True, False]))).cache()

# Calculate counts for True and False
counts = rdd2.aggregate([0,0], agg, lambda x,y: (x[0]+y[0],x[1]+y[1]))
# If filtering is needed
if counts[0]*10 > counts[1]:
    # Calculate sampling ratio
    prob0 = float(counts[1])/10.0 / float(counts[0])
    # Filter falses
    rdd2  = rdd2.flatMap(lambda x: [x] if (x[1] or x[0] and random.random()
< prob0) else [])

# Count True and False again for validation - falses should be 10% of trues
rdd2.aggregate([0,0], agg, lambda x,y: (x[0]+y[0],x[1]+y[1]))

On Fri, Aug 28, 2015 at 6:39 PM, Sonal Goyal <sonalgoy...@gmail.com> wrote:

> Filter into true rdd and false rdd. Union true rdd and sample of false rdd.
> On Aug 28, 2015 2:57 AM, "Gavin Yue" <yue.yuany...@gmail.com> wrote:
>
>> Hey,
>>
>>
>> I have a RDD[(String,Boolean)]. I want to keep all Boolean: True rows and
>> randomly keep some Boolean:false rows.  And hope in the final result, the
>> negative ones could be 10 times more than positive ones.
>>
>>
>> What would be most efficient way to do this?
>>
>> Thanks,
>>
>>
>>
>>


-- 
Best regards, Alexey Grishchenko

phone: +353 (87) 262-2154
email: programme...@gmail.com
web:   http://0x0fff.com

Reply via email to