[ 
https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975820#comment-13975820
 ] 

Xiangrui Meng commented on DATAFU-21:
-------------------------------------

You need to have both bounds of the bin to compute q1 and q2.

1) You need two jobs. The first computes the thresholds and the second does the 
sampling. This is different from SRS. In SRS' streaming case, when p is fixed, 
the interval [p1, p2] is always expanding. But I don't think this is true for 
weighted case. For the scalability, if you make 1000 bins and there are 1000 
partitions, the reducer only need a few MBs. 

2) No. I think we can figure out some reasonable discretization as the default. 
Users should not be aware of it. For example [2^i, 2^{i+1}], i = 
-100,-99,...,100. You need to work out the math.

> Probability weighted sampling without reservoir
> -----------------------------------------------
>
>                 Key: DATAFU-21
>                 URL: https://issues.apache.org/jira/browse/DATAFU-21
>             Project: DataFu
>          Issue Type: New Feature
>         Environment: Mac OS, Linux
>            Reporter: jian wang
>            Assignee: jian wang
>
> This issue is used to track investigation on finding a weighted sampler 
> without using internal reservoir. 
> At present, the SimpleRandomSample has implemented a good 
> acceptance-rejection sampling algo on probability random sampling. The 
> weighted sampler could utilize the simple random sample with slight 
> modification.
> One slight modification is:  the present simple random sample generates a 
> uniform random number lies between (0, 1) as the random variable to accept or 
> reject an item. The weighted sample may generate this random variable based 
> on the item's weight and this random number still lies between (0, 1) and 
> each item's random variable remain independent between each other.
> Need further think and experiment the correctness of this solution and how to 
> implement it in an effective way.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to