[ 
https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075634#comment-14075634
 ] 

jian wang commented on DATAFU-21:
---------------------------------

Further investigation updates:

Still consider the possibility of rejecting items applying Maurer's lemma and 
accepting items applying Bernstein's lemma.
But switch to a different key to associate with each item to simplify the 
computation and avoid discretization.

The different key I plan to use for each item is : X(j) = U * f(w(j)), U is a 
random variable between (0,1), f(w(j)) is a function of w(j) whose value  lies 
in (0, 1), so X(j) is a variable that also lies in (0,1).  f(w(j)) I choose is 
:  f(w(j)) = 1.0 / (1.0 + w(j) / upperBound), upperBound is the upper limit of 
w(j), which is approximated as max{w(j), j = 1 to n},  so 1/f(w(j))  = 1.0 + 
w(j) / upperBound. Note: f(w(j)) is in [0.5, 1), 1 / f(w(j)) is in (1,2],  X(j) 
= U * f(w(j)) is in [0,1).

The reason to use upperBound is to normalize the weights to the (0,1] range so 
that extremely large absolute weights will not make f(w(j)) , if without 
upperBound, extremely small and thus 1/f(w(j)) extremely large. 

Apply Maurer's lemma:

find 0<q1<1, so that we reject items whose key is greater than q1.

let Y(j) = 1 if (X(j) < q1),  = 0 otherwise {j = 1 to n}

Y(1), Y(2)... Y(n) are independent random variables.

E(Y(j)) = Pr(X(j) < q1) * 1 + Pr(X(j) >= q1) * 0
= Pr(U * f(w(j)) < q1)
= Pr(U < q1 / f(w(j)))
= q1 / f(w(j))

E(Y(j) ^ 2) = E(Y(j)) = q1 / f(w(j))

set Y = sum(Y(j), j = 1 to n), which is the number of items whose associated 
random key is less than q1,

we have E(Y) = sum(E(Y(j))) = q1 * sum(1/f(w(j)), j = 1 to n)

we want the probability Pr{Y <= p * n} <= err which could be translated to 
Pr{E(Y) - Y >= E(Y) - p * n} <= err,

apply Maurer's lemma, we get
     
      log(Pr{E(Y) - Y >= E(Y) - p * n}) <= - t * t / (2 * sum(E(Y(j) ^ 2) <= 
log(err), j = 1 to n)), t = E(Y) - p * n = q1 * sum(1/f(w(j)), j = 1 to n) - p 
* n
                                              
Finally we get q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n 
* log(err)) } / { sum(1 / f(w(j)), j = 1 to n) }.

Use f(w(j)) defined at the beginning of this comment, { sum(1 / f(w(j)), j = 1 
to n) } =  sum((1.0 + w(j) / upperBound), j = 1 to n) 
                                                                                
                                                 = n + sum(w(j), j = 1 to n) / 
upperBound

q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / 
{n + (sum(w(j), j = 1 to n) / upperBound) }
       = p * (n / {n + (sum(w(j), j = 1 to n) / upperBound) })
          +  { - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } 
/ {n + (sum(w(j), j = 1 to n) / upperBound) }

The final q1 we get is: p +  ({ - log(err) + sqrt(log(err) * log(err) - 2 * p * 
n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) })

Apply Berstein's lemma:

find 0<q2<1, so that we accept items whose key is less than q2.

let Z(j) = 1 if (X(j) < q2),  = 0 otherwise {j = 1 to n}

E(Z(j)) = Pr{X(j) < q2} = Pr{U * f(w(j)) < q2} = Pr { U < q2 / f(w(j)) } = q2 / 
f(w(j))

E(Z(j) ^ 2) = E(Z(j))

Z(j) - E(Z(j)) = Z(j) - q2 / f(w(j)) <= M = 1

theta(j)^2 = E(Z(j) ^ 2) - E(Z(j)) * E(Z(j)) <= E(Z(j) ^ 2) = q2 / f(w(j))

set Z = sum(Z(j), j = 1 to n), which is the number of items whose associated 
random key is less than q2,

we have E(Z) = sum(E(Z(j)), j = 1 to n) = sum(q2 / f(w(j)), j = 1 to n) = q2 * 
sum(1/f(w(j)), j = 1 to n)

we want the probability Pr{Z >= p * n} <= err which could be translated to Pr{ 
Z - E(Z)  >= p * n - E(Z)} <= err,

let t = p * n - E(Z) = p * n - q2 * sum(1/f(w(j)), j = 1 to n) > 0.

apply Berstein's lemma, we get 

       logPr{ Z - E(Z)  >= p * n - E(Z)}  <= - t * t / (2 * sum(theta(j) ^ 2, j 
= 1 to n) + 2 * M * t / 3) <= log(err)

Finally we get q2 <= { p * n - 2 * log(err) / 3 - 2 * sqrt( log(err) * log(err) 
- 9 * p * n * log(err) / 2) / 3 } / sum(1 / f(w(j)), j = 1 to n).

Observation:
                     (1) q1 is decreasing  as more items are processed;
                     (2) q2 is increasing as more items are processed
                     (3) f(w(j)) = 1.0 / (1.0 + weight / upperBound), if we use 
a global static upperBound, f(w(j)) keeps static. However, this requires us to 
pre-calculate the upperBound or require customer to input an empirical number. 
We could use upperBound local to mapper, combiner and reducer. As more items 
are processed, since upperBound is non-decreasing, f(w(j)) will be 
non-decreasing when more items are processed from mapper to reducer.
                     
Here is a report of simulation experiment: 
https://docs.google.com/document/d/1oCNr7yoIFjwnXx8-24mmV4jY_9X4fdIWJzNW1TTMnqI/edit#

Conclusion: 
                   (1) We could reject items whose keys are greater than q1 and 
accept items whose keys are smaller than q2 if the global upper bound is given;

                   (2) We could reject items using q1 when global upperBound is 
not given;

                   (3) The item sampling probability of using both q1 and q2 
compared with using only q1 is close;

                    (4) Suggest customer to provide a global upper bound of 
item weights.


> Probability weighted sampling without reservoir
> -----------------------------------------------
>
>                 Key: DATAFU-21
>                 URL: https://issues.apache.org/jira/browse/DATAFU-21
>             Project: DataFu
>          Issue Type: New Feature
>         Environment: Mac OS, Linux
>            Reporter: jian wang
>            Assignee: jian wang
>
> This issue is used to track investigation on finding a weighted sampler 
> without using internal reservoir. 
> At present, the SimpleRandomSample has implemented a good 
> acceptance-rejection sampling algo on probability random sampling. The 
> weighted sampler could utilize the simple random sample with slight 
> modification.
> One slight modification is:  the present simple random sample generates a 
> uniform random number lies between (0, 1) as the random variable to accept or 
> reject an item. The weighted sample may generate this random variable based 
> on the item's weight and this random number still lies between (0, 1) and 
> each item's random variable remain independent between each other.
> Need further think and experiment the correctness of this solution and how to 
> implement it in an effective way.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to