[ 
https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108052#comment-14108052
 ] 

Matthew Hayes commented on DATAFU-21:
-------------------------------------

I've looked at your simulation experimental results.  So, it looks like it is 
doing a good job of guaranteeing the number of items in the output matching 
p*n.  Also for a low sampling probability like 0.17, the amount of data sent to 
the reducer is cut down substantially.  For a higher probability like 0.77 the 
algorithm doesn't save us much work.  I'm curious about what the results would 
be for sampling probabilities like 0.01 and 0.5.  How should we interpret the 
plots by the way that compare our algorithm vs. baseline?  I don't understand 
what this is representing.  I'll start taking a look at your simulation code.  

> Probability weighted sampling without reservoir
> -----------------------------------------------
>
>                 Key: DATAFU-21
>                 URL: https://issues.apache.org/jira/browse/DATAFU-21
>             Project: DataFu
>          Issue Type: New Feature
>         Environment: Mac OS, Linux
>            Reporter: jian wang
>            Assignee: jian wang
>         Attachments: DATAFU-21.patch
>
>
> This issue is used to track investigation on finding a weighted sampler 
> without using internal reservoir. 
> At present, the SimpleRandomSample has implemented a good 
> acceptance-rejection sampling algo on probability random sampling. The 
> weighted sampler could utilize the simple random sample with slight 
> modification.
> One slight modification is:  the present simple random sample generates a 
> uniform random number lies between (0, 1) as the random variable to accept or 
> reject an item. The weighted sample may generate this random variable based 
> on the item's weight and this random number still lies between (0, 1) and 
> each item's random variable remain independent between each other.
> Need further think and experiment the correctness of this solution and how to 
> implement it in an effective way.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to