[jira] [Comment Edited] (DATAFU-21) Probability weighted sampling without reservoir

jian wang (JIRA) Tue, 08 Jul 2014 23:07:25 -0700

    [ 
https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944325#comment-13944325
 ]


jian wang edited comment on DATAFU-21 at 7/9/14 6:05 AM:
---------------------------------------------------------

Some investigation updates:

Based on the theories from paper: 
http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf, I plan to 
associate each item with a key X(j) = 1 - pow(U, 1/w(j)), U is a random 
variable between (0,1). Then follow the thought of Random Sort, we sort the 
items in ascending order based on X(j) and select the smallest k = p * n items.

Also as simple random sampling algorithm, we could also consider the 
possibility of rejecting items applying Maurer's lemma and accepting items 
applying Bernstein's lemma.

Apply Maurer's lemma:

we would like to find 0<q1<1, so that we reject items whose  key is greater 
than q1.

let Y(j) = 1 if (X(j) < q1)
           =  0 otherwise

{Y(j), j = 1 to n} are independent random variables.

E(Y(j)) = Pr(X(j) < q1) * 1 + Pr(X(j) >= q1) * 0
            = Pr(1 - pow( U, 1/w(j) ) < q1)
            = Pr(1 - q1 < pow( U, 1/w(j) ))
            = Pr(pow(1 - q1, w(j) ) < U) = 1 - pow( 1 - q1, w(j) )

E(Y(j) ^ 2) = E(Y(j)) = 1 - pow(1 - q1, w(j) )

set Y = sum(Y(j), j = 1 to n),  

      Q1 = sum( pow( 1-q1, w(j) ) , j = 1 to n)

E(Y) = sum(E(Y(j))) = n - sum( pow( 1-q1, w(j) ) ) = n - Q1

apply Maurer's lemma with t = (1 - p) * n - sum( pow(1 - q1, w(j) ) ) = (1 - p) 
* n - Q1, since t > 0, Q1 < (1 - p) * n.  Solving the inequality, I get 
        abs( Q1 - (1 - p) * n - log(err) ) >= sqrt( log(err) ^ 2 - 2 * p * n * 
log(err) )    (1)

we could get q1 by solving (1)

Apply Berstein's lemma:

similar to applying Maurer's lemma, we could get a q2 so that we could accept 
item whose key is smaller than q2, 0 <= q2 <= 1.

let  Z(j) = 1 if X(j) < q2, 
            = 0 if X(j) >= q2

{Z(j), j = 1 to n} are independent random variables.

E(Z(j)) = Pr(X(j) < q2) * 1 + Pr(X(j) >= q2) *  0
           = Pr(1 - pow( U, 1/w(j) ) < q2)
           = 1 - pow(1 - q2, w(j) )

E(Z(j) ^ 2) = E(Z(j))

Z(j) - E(Z(j)) <= 1 - E(Z(j)) = pow(1 - q2, w(j) ) <= 1 = M

theta(j) ^ 2 = E(Z(j) ^ 2) - E(Z(j)) ^ 2 <= E(Z(j) ^ 2) = 1 - pow(1 - q2, w(j) )

set Z = sum(Z(j),  j = 1 to n)

      Q2 = sum( pow(1 - q2, w(j) ) )

      E(Z) = sum(E(Z(j)), j = 1 to n) = n - sum( pow(1 - q2, w(j) ) ) = n - Q2

apply Berstein's lemma with t = sum( pow(1 - q2, w(j) ) ) - (1 - p) * n = Q2 -  
(1 - p) * n,  since t > 0, Q2 > (1 - p) * n, I get  
    abs(Q2  - n * (1 - p) - 2 / 3 * log(err) ) = 
    Q2 - n * (1 - p) - 2 / 3 * log(err) <=  2 / 3 * sqrt( log(err) * log(err) - 
9 * n * p / 2 * log(err) )     (2)

we could get q2 by solving (2)

Questions:

(1) Please help comment on the above approach. Do they overall make sense? 

(2) I am stuck in getting q1 and q2 by solving (1) and (2) respectively. Would 
like to seek some advice on it. 

Use Newton method to solve (1):

 abs( Q1 - (1 - p) * n - log(err) ) >= sqrt( log(err) ^ 2 - 2 * p * n * 
log(err) )  = F(n, p, err)     (1)

 Q1 = sum( pow( 1-q1, w(j) ) , j = 1 to n),  0 < q1 < 1

 Q1' = sum( -1 * w(j) * (1 - q1) ^ (w(j) - 1) ), 0 < q1 < 1,  Q1' means the 1st 
derivative of Q1.

Remove the less-than inequality of (1), our target is to get an approximate q1 
that makes abs( Q1 - (1 - p) * n - log(err) )  close to F(n, p, err), we name 
it as q1_t. 

Set function f(q1) = abs( Q1 - (1 - p) * n - log(err) )  - F(n, p, err). Use 
Newton–Raphson method to get q1_t so that f(q1_t) is close to 0. 

Use Newton to solve (2):
        Q2 - n * (1 - p) - 2 / 3 * log(err) <=  2 / 3 * sqrt( log(err) * 
log(err) - 9 * n * p / 2 * log(err) )   (2)

Q2 = sum( pow(1 - q2, w(j) ) )
Q2' = sum( -1 * w(j) * (1 - q2) ^ (w(j) - 1) ), 0 < q2 < 1,  Q2' means the 1st 
derivative of Q2.

Remove the less-than inequality of (2), our target is to get an approximate q2 
that makes  f(q2) = Q2 - n * (1 - p) - 2 / 3 * log(err) - 2 / 3 * sqrt( 
log(err) * log(err) - 9 * n * p / 2 * log(err) )  is close to 0.





was (Author: king821221):
Some investigation updates:

Based on the theories from paper: 
http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf, I plan to 
associate each item with a key X(j) = 1 - pow(U, 1/w(j)), U is a random 
variable between (0,1). Then follow the thought of Random Sort, we sort the 
items in ascending order based on X(j) and select the smallest k = p * n items.

Also as simple random sampling algorithm, we could also consider the 
possibility of rejecting items applying Maurer's lemma and accepting items 
applying Bernstein's lemma.

Apply Maurer's lemma:

we would like to find 0<q1<1, so that we reject items whose  key is greater 
than q1.

let Y(j) = 1 if (X(j) < q1)
           =  0 otherwise

{Y(j), j = 1 to n} are independent random variables.

E(Y(j)) = Pr(X(j) < q1) * 1 + Pr(X(j) >= q1) * 0
            = Pr(1 - pow( U, 1/w(j) ) < q1)
            = Pr(1 - q1 < pow( U, 1/w(j) ))
            = Pr(pow(1 - q1, w(j) ) < U) = 1 - pow( 1 - q1, w(j) )

E(Y(j) ^ 2) = E(Y(j)) = 1 - pow(1 - q1, w(j) )

set Y = sum(Y(j), j = 1 to n),  

      Q1 = sum( pow( 1-q1, w(j) ) , j = 1 to n)

E(Y) = sum(E(Y(j))) = n - sum( pow( 1-q1, w(j) ) ) = n - Q1

apply Maurer's lemma with t = (1 - p) * n - sum( pow(1 - q1, w(j) ) ) = (1 - p) 
* n - Q1, since t > 0, Q1 < (1 - p) * n.  Solving the inequality, I get 
        abs( Q1 - (1 - p) * n - log(err) ) >= sqrt( log(err) ^ 2 - 2 * p * n * 
log(err) )    (1)

we could get q1 by solving (1)

Apply Berstein's lemma:

similar to applying Maurer's lemma, we could get a q2 so that we could accept 
item whose key is smaller than q2, 0 <= q2 <= 1.

let  Z(j) = 1 if X(j) < q2, 
            = 0 if X(j) >= q2

{Z(j), j = 1 to n} are independent random variables.

E(Z(j)) = Pr(X(j) < q2) * 1 + Pr(X(j) >= q2) *  0
           = Pr(1 - pow( U, 1/w(j) ) < q2)
           = 1 - pow(1 - q2, w(j) )

E(Z(j) ^ 2) = E(Z(j))

Z(j) - E(Z(j)) <= 1 - E(Z(j)) = pow(1 - q2, w(j) ) <= 1 = M

theta(j) ^ 2 = E(Z(j) ^ 2) - E(Z(j)) ^ 2 <= E(Z(j) ^ 2) = 1 - pow(1 - q2, w(j) )

set Z = sum(Z(j),  j = 1 to n)

      Q2 = sum( pow(1 - q2, w(j) ) )

      E(Z) = sum(E(Z(j)), j = 1 to n) = n - sum( pow(1 - q2, w(j) ) ) = n - Q2

apply Berstein's lemma with t = sum( pow(1 - q2, w(j) ) ) - (1 - p) * n = Q2 -  
(1 - p) * n, I get  
    abs(Q2  - n * (1 - p) - 2 / 3 * log(err) ) <=  2 / 3 * sqrt( log(err) * 
log(err) - 9 * n * p / 2 * log(err) )     (2)

we could get q2 by solving (2)

Questions:

(1) Please help comment on the above approach. Do they overall make sense? 

(2) I am stuck in getting q1 and q2 by solving (1) and (2) respectively. Would 
like to seek some advice on it. 

Some thoughts on how to resolve this, eg: solve (1)

 abs( Q1 - (1 - p) * n - log(err) ) >= sqrt( log(err) ^ 2 - 2 * p * n * 
log(err) )  = F(n, p, err)     (1)

 Q1 = sum( pow( 1-q1, w(j) ) , j = 1 to n),  0 < q1 < 1

 Q1' = sum( -1 * w(j) * (1 - q1) ^ (w(j) - 1) ), 0 < q1 < 1,  Q1' means the 1st 
derivative of Q1.

Remove the less-than inequality of (1), our target is to get an approximate q1 
that makes abs( Q1 - (1 - p) * n - log(err) )  close to F(n, p, err), we name 
it as q1_t. 

Set function f(q1) = abs( Q1 - (1 - p) * n - log(err) )  - F(n, p, err). Use 
Newton–Raphson method to get q1_t so that f(q1_t) is close to 0. 




> Probability weighted sampling without reservoir
> -----------------------------------------------
>
>                 Key: DATAFU-21
>                 URL: https://issues.apache.org/jira/browse/DATAFU-21
>             Project: DataFu
>          Issue Type: New Feature
>         Environment: Mac OS, Linux
>            Reporter: jian wang
>            Assignee: jian wang
>
> This issue is used to track investigation on finding a weighted sampler 
> without using internal reservoir. 
> At present, the SimpleRandomSample has implemented a good 
> acceptance-rejection sampling algo on probability random sampling. The 
> weighted sampler could utilize the simple random sample with slight 
> modification.
> One slight modification is:  the present simple random sample generates a 
> uniform random number lies between (0, 1) as the random variable to accept or 
> reject an item. The weighted sample may generate this random variable based 
> on the item's weight and this random number still lies between (0, 1) and 
> each item's random variable remain independent between each other.
> Need further think and experiment the correctness of this solution and how to 
> implement it in an effective way.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (DATAFU-21) Probability weighted sampling without reservoir

Reply via email to