[ 
https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108797#comment-14108797
 ] 

jian wang commented on DATAFU-21:
---------------------------------

I am not sure of the correlation between item weights and the percentage of 
time items with that weight is selected. Is it for performance evaluation?  
Since items with higher weights are not necessarily taking more time to get 
sampled. 

Bucketize the weights and see how they are sampled is maybe useful. I am doing 
with the UDF itself to run Hadoop jobs to see how the weights in different 
buckets are sampled.  I could provide you with the results in the doc if 
needed. Initial observation is that if there is a much larger number of items 
with  lower weights in the initial population, there are still a lot of items 
with lower weights in the final sample.

> Probability weighted sampling without reservoir
> -----------------------------------------------
>
>                 Key: DATAFU-21
>                 URL: https://issues.apache.org/jira/browse/DATAFU-21
>             Project: DataFu
>          Issue Type: New Feature
>         Environment: Mac OS, Linux
>            Reporter: jian wang
>            Assignee: jian wang
>         Attachments: DATAFU-21.patch
>
>
> This issue is used to track investigation on finding a weighted sampler 
> without using internal reservoir. 
> At present, the SimpleRandomSample has implemented a good 
> acceptance-rejection sampling algo on probability random sampling. The 
> weighted sampler could utilize the simple random sample with slight 
> modification.
> One slight modification is:  the present simple random sample generates a 
> uniform random number lies between (0, 1) as the random variable to accept or 
> reject an item. The weighted sample may generate this random variable based 
> on the item's weight and this random number still lies between (0, 1) and 
> each item's random variable remain independent between each other.
> Need further think and experiment the correctness of this solution and how to 
> implement it in an effective way.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to