[ 
https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075634#comment-14075634
 ] 

jian wang edited comment on DATAFU-21 at 8/23/14 3:08 PM:
----------------------------------------------------------

Further investigation updates:

Still consider the possibility of rejecting items applying Maurer's lemma and 
accepting items applying Bernstein's lemma.
But switch to a different key to associate with each item to simplify the 
computation and avoid discretization.

The different key I plan to use for each item is : X(j) = U * f(w(j)), U is a 
random variable between (0,1), f(w(j)) is a function of w(j) whose value  lies 
in (0, 1), so X(j) is a variable that also lies in (0,1).  f(w(j)) I choose is 
:  f(w(j)) = 1.0 / (1.0 + w(j) / upperBound), upperBound is the upper limit of 
w(j), which is approximated as max{w(j), j = 1 to n},  so 1/f(w(j))  = 1.0 + 
w(j) / upperBound. Note: f(w(j)) is in [0.5, 1), 1 / f(w(j)) is in (1,2],  X(j) 
= U * f(w(j)) is in [0,1).

The reason to use upperBound is to normalize the weights to the (0,1] range so 
that extremely large absolute weights will not make f(w(j)) , if without 
upperBound, extremely small and thus 1/f(w(j)) extremely large. 

Apply Maurer's lemma:

find 0<q1<1, so that we reject items whose key is greater than q1.

let Y(j) = 1 if (X(j) < q1),  Y(j) = 0 otherwise {j = 1 to n}

Y(j) are independent random variables.

E(Y(j)) = Pr(X(j) < q1) * 1 + Pr(X(j) >= q1) * 0 
           = Pr (U * f(w(j)) < q1) = Pr(U < q1 / f(w(j)))
           = q1 / f(w(j))

E(Y(j) ^ 2) = E(Y(j)) = q1 / f(w(j))

set Y = sum(Y(j), j = 1 to n), which is the number of items whose associated 
random key is less than q1,

we have E(Y) = sum(E(Y(j))) = q1 * sum(1/f(w(j)), j = 1 to n)

we want the probability Pr{Y <= p * n} <= err which could be translated to 
Pr{E(Y) - Y >= E(Y) - p * n} <= err,

apply Maurer's lemma, we get
     
      log(Pr{E(Y) - Y >= E(Y) - p * n}) <= - t * t / (2 * sum(E(Y(j) ^ 2) <= 
log(err), j = 1 to n)), t = E(Y) - p * n = q1 * sum(1/f(w(j)), j = 1 to n) - p 
* n
                                              
Finally we get q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n 
* log(err)) } / { sum(1 / f(w(j)), j = 1 to n) }.

Use f(w(j)) defined at the beginning of this comment, { sum(1 / f(w(j)), j = 1 
to n) } =  sum((1.0 + w(j) / upperBound), j = 1 to n)  = n + sum(w(j), j = 1 to 
n) / upperBound

q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / 
{n + (sum(w(j), j = 1 to n) / upperBound) }
       = p * (n / {n + (sum(w(j), j = 1 to n) / upperBound) })
          +  { - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } 
/ {n + (sum(w(j), j = 1 to n) / upperBound) }

The final q1 we use is: p +  ({ - log(err) + sqrt(log(err) * log(err) - 2 * p * 
n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) })

Apply Berstein's lemma:

find 0<q2<1, so that we accept items whose key is less than q2.

let Z(j) = 1 if (X(j) < q2),  Z(j) = 0 otherwise {j = 1 to n}

Z(j) are independent random variables. 

E(Z(j)) = Pr{X(j) < q2} = Pr{U * f(w(j)) < q2} = Pr{U < q2 / f(w(j)) } = q2 / 
f(w(j))

E(Z(j) ^ 2) = E(Z(j))

Z(j) - E(Z(j)) = Z(j) - q2 / f(w(j)) <= M = 1

theta(j)^2 = E(Z(j) ^ 2) - E(Z(j)) * E(Z(j)) <= E(Z(j) ^ 2) = q2 / f(w(j))

set Z = sum(Z(j), j = 1 to n), which is the number of items whose associated 
random key is less than q2,

we have E(Z) = sum(E(Z(j)), j = 1 to n) = sum(q2 / f(w(j)), j = 1 to n) = q2 * 
sum(1/f(w(j)), j = 1 to n)

we want the probability Pr{Z >= p * n} <= err which could be translated to Pr{Z 
- E(Z)  >= p * n - E(Z)} <= err,

let t = p * n - E(Z) = p * n - q2 * sum(1/f(w(j)), j = 1 to n) > 0.

apply Berstein's lemma, we get 

       logPr{Z - E(Z)  >= p * n - E(Z)}  <= - t * t / (2 * sum(theta(j) ^ 2, j 
= 1 to n) + 2 * M * t / 3) <= log(err)

Finally we get q2 <= { p * n - 2 * log(err) / 3 - 2 * sqrt( log(err) * log(err) 
- 9 * p * n * log(err) / 2) / 3 } / sum(1 / f(w(j)), j = 1 to n).

Observation:
                     (1) q1 is decreasing  as more items are processed;
                     (2) q2 is increasing as more items are processed;
                     (3) Compared with a chosen baseline weighted sampling 
algorithm, the distribution of  sampling probability relative to item weight is 
similar but items with higher weights have relative higher sampled frequencies 
in the baseline weighted sampling algorithm.

Here is a report of simulation experiment: 
https://docs.google.com/document/d/1oCNr7yoIFjwnXx8-24mmV4jY_9X4fdIWJzNW1TTMnqI/edit#,
 please help provide feedback on the experiment result.

Conclusion: 
                   (1) We could reject items whose keys are greater than q1 and 
accept items whose keys are smaller than q2  given the global upper bound of 
item weights;

                   (2) At present we enforce an input from customer the global 
upper bound of items' weights. We may come back to see if this is required or 
could be handled in another way;
 
                   (3)  The sampling probability is relatively higher for items 
with higher weights compared with lower weight items, but the probability 
difference between the higher weight items and the lower weight items is less 
significant compared with our chosen baseline weighted sampling algorithm.



was (Author: king821221):
Further investigation updates:

Still consider the possibility of rejecting items applying Maurer's lemma and 
accepting items applying Bernstein's lemma.
But switch to a different key to associate with each item to simplify the 
computation and avoid discretization.

The different key I plan to use for each item is : X(j) = U * f(w(j)), U is a 
random variable between (0,1), f(w(j)) is a function of w(j) whose value  lies 
in (0, 1), so X(j) is a variable that also lies in (0,1).  f(w(j)) I choose is 
:  f(w(j)) = 1.0 / (1.0 + w(j) / upperBound), upperBound is the upper limit of 
w(j), which is approximated as max{w(j), j = 1 to n},  so 1/f(w(j))  = 1.0 + 
w(j) / upperBound. Note: f(w(j)) is in [0.5, 1), 1 / f(w(j)) is in (1,2],  X(j) 
= U * f(w(j)) is in [0,1).

The reason to use upperBound is to normalize the weights to the (0,1] range so 
that extremely large absolute weights will not make f(w(j)) , if without 
upperBound, extremely small and thus 1/f(w(j)) extremely large. 

Apply Maurer's lemma:

find 0<q1<1, so that we reject items whose key is greater than q1.

let Y(j) = 1 if (X(j) < q1),  Y(j) = 0 otherwise {j = 1 to n}

Y(j) are independent random variables.

E(Y(j)) = Pr(X(j) < q1) * 1 + Pr(X(j) >= q1) * 0 
           = Pr (U * f(w(j)) < q1) = Pr(U < q1 / f(w(j)))
           = q1 / f(w(j))

E(Y(j) ^ 2) = E(Y(j)) = q1 / f(w(j))

set Y = sum(Y(j), j = 1 to n), which is the number of items whose associated 
random key is less than q1,

we have E(Y) = sum(E(Y(j))) = q1 * sum(1/f(w(j)), j = 1 to n)

we want the probability Pr{Y <= p * n} <= err which could be translated to 
Pr{E(Y) - Y >= E(Y) - p * n} <= err,

apply Maurer's lemma, we get
     
      log(Pr{E(Y) - Y >= E(Y) - p * n}) <= - t * t / (2 * sum(E(Y(j) ^ 2) <= 
log(err), j = 1 to n)), t = E(Y) - p * n = q1 * sum(1/f(w(j)), j = 1 to n) - p 
* n
                                              
Finally we get q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n 
* log(err)) } / { sum(1 / f(w(j)), j = 1 to n) }.

Use f(w(j)) defined at the beginning of this comment, { sum(1 / f(w(j)), j = 1 
to n) } =  sum((1.0 + w(j) / upperBound), j = 1 to n)  = n + sum(w(j), j = 1 to 
n) / upperBound

q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / 
{n + (sum(w(j), j = 1 to n) / upperBound) }
       = p * (n / {n + (sum(w(j), j = 1 to n) / upperBound) })
          +  { - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } 
/ {n + (sum(w(j), j = 1 to n) / upperBound) }

The final q1 we use is: p +  ({ - log(err) + sqrt(log(err) * log(err) - 2 * p * 
n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) })

Apply Berstein's lemma:

find 0<q2<1, so that we accept items whose key is less than q2.

let Z(j) = 1 if (X(j) < q2),  Z(j) = 0 otherwise {j = 1 to n}

Z(j) are independent random variables. 

E(Z(j)) = Pr{X(j) < q2} = Pr{U * f(w(j)) < q2} = Pr{U < q2 / f(w(j)) } = q2 / 
f(w(j))

E(Z(j) ^ 2) = E(Z(j))

Z(j) - E(Z(j)) = Z(j) - q2 / f(w(j)) <= M = 1

theta(j)^2 = E(Z(j) ^ 2) - E(Z(j)) * E(Z(j)) <= E(Z(j) ^ 2) = q2 / f(w(j))

set Z = sum(Z(j), j = 1 to n), which is the number of items whose associated 
random key is less than q2,

we have E(Z) = sum(E(Z(j)), j = 1 to n) = sum(q2 / f(w(j)), j = 1 to n) = q2 * 
sum(1/f(w(j)), j = 1 to n)

we want the probability Pr{Z >= p * n} <= err which could be translated to Pr{Z 
- E(Z)  >= p * n - E(Z)} <= err,

let t = p * n - E(Z) = p * n - q2 * sum(1/f(w(j)), j = 1 to n) > 0.

apply Berstein's lemma, we get 

       logPr{Z - E(Z)  >= p * n - E(Z)}  <= - t * t / (2 * sum(theta(j) ^ 2, j 
= 1 to n) + 2 * M * t / 3) <= log(err)

Finally we get q2 <= { p * n - 2 * log(err) / 3 - 2 * sqrt( log(err) * log(err) 
- 9 * p * n * log(err) / 2) / 3 } / sum(1 / f(w(j)), j = 1 to n).

Observation:
                     (1) q1 is decreasing  as more items are processed;
                     (2) q2 is increasing as more items are processed
                     (3) f(w(j)) = 1.0 / (1.0 + weight / upperBound), if we use 
a global static upperBound, f(w(j)) keeps static. However, this requires us to 
pre-calculate the upperBound or require customer to input an empirical number. 
We could use upperBound local to mapper, combiner and reducer. As more items 
are processed, since upperBound is non-decreasing, f(w(j)) will be 
non-decreasing when more items are processed from mapper to reducer;
                     (4) Compared with a chosen baseline weighted sampling 
algorithm, the distribution of  sampling probability relative to item weight is 
similar but items with higher weights have relative higher sampled frequencies 
in the baseline weighted sampling algorithm.

Here is a report of simulation experiment: 
https://docs.google.com/document/d/1oCNr7yoIFjwnXx8-24mmV4jY_9X4fdIWJzNW1TTMnqI/edit#,
 please help provide feedback on the experiment result.

Conclusion: 
                   (1) We could reject items whose keys are greater than q1 and 
accept items whose keys are smaller than q2 if the global upper bound is given;

                   (2) We could reject items using q1 when global upperBound is 
not given;

                   (3) The item sampling probability of using both q1 and q2 
compared with using only q1 is close;
 
                   (4)  The sampling probability is relatively higher for items 
with higher weights compared with lower weight items, but the probability 
difference between the higher weight items and the lower weight items is less 
significant compared with our chosen baseline weighted sampling algorithm;

                   (5) Suggest customer to provide a global upper bound of item 
weights.


> Probability weighted sampling without reservoir
> -----------------------------------------------
>
>                 Key: DATAFU-21
>                 URL: https://issues.apache.org/jira/browse/DATAFU-21
>             Project: DataFu
>          Issue Type: New Feature
>         Environment: Mac OS, Linux
>            Reporter: jian wang
>            Assignee: jian wang
>
> This issue is used to track investigation on finding a weighted sampler 
> without using internal reservoir. 
> At present, the SimpleRandomSample has implemented a good 
> acceptance-rejection sampling algo on probability random sampling. The 
> weighted sampler could utilize the simple random sample with slight 
> modification.
> One slight modification is:  the present simple random sample generates a 
> uniform random number lies between (0, 1) as the random variable to accept or 
> reject an item. The weighted sample may generate this random variable based 
> on the item's weight and this random number still lies between (0, 1) and 
> each item's random variable remain independent between each other.
> Need further think and experiment the correctness of this solution and how to 
> implement it in an effective way.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to