[
https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108412#comment-14108412
]
jian wang commented on DATAFU-21:
---------------------------------
Hi , updated the review request: https://reviews.apache.org/r/25009/, please
help check if it is accessible.
Will provide 0.01 and 0.5 sampling statistics in the experiment doc.
The purpose of the baseline algorithm is to compare with this algorithm in the
frequency of items sampled under different weights, given the same sampling
probability. The baseline algorithm is using the score(j) = U ^ (1.0/w(j))
formula which is the same used in weighted reservoir sampling, and the baseline
algorithm does not use acceptance and rejection. The logic could be found in
the method "generateBaselineWeightDistribution()".
In a specific plot, x-axis as the item weight and y-axis denotes the number of
items of given item weight that are sampled, under repeated experiments. We
would like to use 2 different diagrams in the plot to see if for items with the
same weight, the probability of them being sampled are almost identical in both
baseline algorithm and our algorithm; and moreover, items with higher weights
have higher probability of being sampled than items with lower weights and how
high it could be. So if the green area and the black area have almost the same
shape and have a big overlap, maybe our algorithm could perform at least as
well as the baseline algorithm.
> Probability weighted sampling without reservoir
> -----------------------------------------------
>
> Key: DATAFU-21
> URL: https://issues.apache.org/jira/browse/DATAFU-21
> Project: DataFu
> Issue Type: New Feature
> Environment: Mac OS, Linux
> Reporter: jian wang
> Assignee: jian wang
> Attachments: DATAFU-21.patch
>
>
> This issue is used to track investigation on finding a weighted sampler
> without using internal reservoir.
> At present, the SimpleRandomSample has implemented a good
> acceptance-rejection sampling algo on probability random sampling. The
> weighted sampler could utilize the simple random sample with slight
> modification.
> One slight modification is: the present simple random sample generates a
> uniform random number lies between (0, 1) as the random variable to accept or
> reject an item. The weighted sample may generate this random variable based
> on the item's weight and this random number still lies between (0, 1) and
> each item's random variable remain independent between each other.
> Need further think and experiment the correctness of this solution and how to
> implement it in an effective way.
--
This message was sent by Atlassian JIRA
(v6.2#6252)