[
https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974807#comment-13974807
]
jian wang commented on DATAFU-21:
---------------------------------
Hi,sorry for the late response.
So you mean we put the input w1, w2, ....wN into different bins and compress
the original input into {<average weight in the bin, number of elements in the
bin>}, right? There are some questions I would like to clarify:
(1) Do we need a separate map-reduce job to calculate the q1 and q2 and
use the q1 and q2 as parameters to our UDF in subsequent jobs, or we follow the
SimpleRandomSample to try to calculate q1, q2 on the fly to see if we could
accept or reject items as they stream in? I do not quite catch what you mean by
solving equation in a single reducer or on a single node. This way could we
suffer from scalability problem?
(2) If we would like to discretize the weights, we may need to consider
the width of bin and the level of accuracy we would like to achieve. Do we need
to expose parameter for users to control these, though as a customer, I do not
quite like this, or we leverage some rules, such as defined in
http://en.wikipedia.org/wiki/Histogram?
> Probability weighted sampling without reservoir
> -----------------------------------------------
>
> Key: DATAFU-21
> URL: https://issues.apache.org/jira/browse/DATAFU-21
> Project: DataFu
> Issue Type: New Feature
> Environment: Mac OS, Linux
> Reporter: jian wang
> Assignee: jian wang
>
> This issue is used to track investigation on finding a weighted sampler
> without using internal reservoir.
> At present, the SimpleRandomSample has implemented a good
> acceptance-rejection sampling algo on probability random sampling. The
> weighted sampler could utilize the simple random sample with slight
> modification.
> One slight modification is: the present simple random sample generates a
> uniform random number lies between (0, 1) as the random variable to accept or
> reject an item. The weighted sample may generate this random variable based
> on the item's weight and this random number still lies between (0, 1) and
> each item's random variable remain independent between each other.
> Need further think and experiment the correctness of this solution and how to
> implement it in an effective way.
--
This message was sent by Atlassian JIRA
(v6.2#6252)