[ 
https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974807#comment-13974807
 ] 

jian wang edited comment on DATAFU-21 at 4/21/14 1:32 PM:
----------------------------------------------------------

Hi,sorry for the late response. 

So you mean we put the input w1, w2, ....wN into different bins and compress 
the original input into {<average weight in the bin, number of elements in the 
bin>}, right?  There are some questions I would like to clarify:

      (1) Do we need a separate map-reduce job to calculate the q1 and q2 and 
use the q1 and q2 as parameters to our UDF in subsequent jobs, or we follow the 
SimpleRandomSample to try to calculate q1, q2 on the fly to see if we could 
accept or reject items as they stream in? I do not quite catch what you mean by 
solving equation in a single reducer or on a single node. This way could we 
suffer from scalability problem?

      (2) If we would like to discretize the weights, we may need to consider 
the width of bin and the level of accuracy we would like to achieve. Do we need 
to expose parameter for users to control these, though as a customer, I do not 
quite like this, or we leverage some rules, such as defined in 
http://en.wikipedia.org/wiki/Histogram?

I have offline tried the newton method, as you have suggested, if the size of 
the input data is very large, such as 100K or above, the newton will be slow to 
converge. And I have experimented to see if the rejection could apply to 
streaming case but found that the q1 is getting larger as we process more 
weights, this seems to betray our purpose of using it in streaming case.


was (Author: king821221):
Hi,sorry for the late response. 

So you mean we put the input w1, w2, ....wN into different bins and compress 
the original input into {<average weight in the bin, number of elements in the 
bin>}, right?  There are some questions I would like to clarify:

      (1) Do we need a separate map-reduce job to calculate the q1 and q2 and 
use the q1 and q2 as parameters to our UDF in subsequent jobs, or we follow the 
SimpleRandomSample to try to calculate q1, q2 on the fly to see if we could 
accept or reject items as they stream in? I do not quite catch what you mean by 
solving equation in a single reducer or on a single node. This way could we 
suffer from scalability problem?

      (2) If we would like to discretize the weights, we may need to consider 
the width of bin and the level of accuracy we would like to achieve. Do we need 
to expose parameter for users to control these, though as a customer, I do not 
quite like this, or we leverage some rules, such as defined in 
http://en.wikipedia.org/wiki/Histogram?

> Probability weighted sampling without reservoir
> -----------------------------------------------
>
>                 Key: DATAFU-21
>                 URL: https://issues.apache.org/jira/browse/DATAFU-21
>             Project: DataFu
>          Issue Type: New Feature
>         Environment: Mac OS, Linux
>            Reporter: jian wang
>            Assignee: jian wang
>
> This issue is used to track investigation on finding a weighted sampler 
> without using internal reservoir. 
> At present, the SimpleRandomSample has implemented a good 
> acceptance-rejection sampling algo on probability random sampling. The 
> weighted sampler could utilize the simple random sample with slight 
> modification.
> One slight modification is:  the present simple random sample generates a 
> uniform random number lies between (0, 1) as the random variable to accept or 
> reject an item. The weighted sample may generate this random variable based 
> on the item's weight and this random number still lies between (0, 1) and 
> each item's random variable remain independent between each other.
> Need further think and experiment the correctness of this solution and how to 
> implement it in an effective way.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to