[ https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13963731#comment-13963731 ]
Xiangrui Meng commented on DATAFU-21: ------------------------------------- Jian, the main problem is not solving the equation but the size of the equation. The left-hand side of the equation is monotone, so the simplest way to solve it is bi-section. However, the number of weights is the same as the number of records, which may be very large. This is why I said you need to discretize the weights and compress the data in my previous comment. Assume that there are 10000 weights in the same partition between [0.6, 0.600001]. You can treat all of them as 0.6 and remember the count 10000. Then, you compress the data size from 10000 to 2 (weight and count). In this way, you can solve the equation on a single reducer. > Probability weighted sampling without reservoir > ----------------------------------------------- > > Key: DATAFU-21 > URL: https://issues.apache.org/jira/browse/DATAFU-21 > Project: DataFu > Issue Type: New Feature > Environment: Mac OS, Linux > Reporter: jian wang > Assignee: jian wang > > This issue is used to track investigation on finding a weighted sampler > without using internal reservoir. > At present, the SimpleRandomSample has implemented a good > acceptance-rejection sampling algo on probability random sampling. The > weighted sampler could utilize the simple random sample with slight > modification. > One slight modification is: the present simple random sample generates a > uniform random number lies between (0, 1) as the random variable to accept or > reject an item. The weighted sample may generate this random variable based > on the item's weight and this random number still lies between (0, 1) and > each item's random variable remain independent between each other. > Need further think and experiment the correctness of this solution and how to > implement it in an effective way. -- This message was sent by Atlassian JIRA (v6.2#6252)