[
https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108052#comment-14108052
]
Matthew Hayes commented on DATAFU-21:
-------------------------------------
I've looked at your simulation experimental results. So, it looks like it is
doing a good job of guaranteeing the number of items in the output matching
p*n. Also for a low sampling probability like 0.17, the amount of data sent to
the reducer is cut down substantially. For a higher probability like 0.77 the
algorithm doesn't save us much work. I'm curious about what the results would
be for sampling probabilities like 0.01 and 0.5. How should we interpret the
plots by the way that compare our algorithm vs. baseline? I don't understand
what this is representing. I'll start taking a look at your simulation code.
> Probability weighted sampling without reservoir
> -----------------------------------------------
>
> Key: DATAFU-21
> URL: https://issues.apache.org/jira/browse/DATAFU-21
> Project: DataFu
> Issue Type: New Feature
> Environment: Mac OS, Linux
> Reporter: jian wang
> Assignee: jian wang
> Attachments: DATAFU-21.patch
>
>
> This issue is used to track investigation on finding a weighted sampler
> without using internal reservoir.
> At present, the SimpleRandomSample has implemented a good
> acceptance-rejection sampling algo on probability random sampling. The
> weighted sampler could utilize the simple random sample with slight
> modification.
> One slight modification is: the present simple random sample generates a
> uniform random number lies between (0, 1) as the random variable to accept or
> reject an item. The weighted sample may generate this random variable based
> on the item's weight and this random number still lies between (0, 1) and
> each item's random variable remain independent between each other.
> Need further think and experiment the correctness of this solution and how to
> implement it in an effective way.
--
This message was sent by Atlassian JIRA
(v6.2#6252)