[ 
https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075634#comment-14075634
 ] 

jian wang edited comment on DATAFU-21 at 7/28/14 4:19 PM:
----------------------------------------------------------

Further investigation updates:

Still consider the possibility of rejecting items applying Maurer's lemma and 
accepting items applying Bernstein's lemma.
But switch to a different key to associate with each item to simplify the 
computation and avoid discretization.

The different key I plan to use for each item is : X(j) = U * f(w(j)), U is a 
random variable between (0,1), f(w(j)) is a function of w(j) whose value  lies 
in (0, 1), so X(j) is a variable that also lies in (0,1).  f(w(j)) I choose is 
:  f(w(j)) = 1.0 / (1.0 + w(j) / upperBound), upperBound is the upper limit of 
w(j), which is approximated as max{w(j), j = 1 to n},  so 1/f(w(j))  = 1.0 + 
w(j) / upperBound. Note: f(w(j)) is in [0.5, 1), 1 / f(w(j)) is in (1,2],  X(j) 
= U * f(w(j)) is in [0,1).

The reason to use upperBound is to normalize the weights to the (0,1] range so 
that extremely large absolute weights will not make f(w(j)) , if without 
upperBound, extremely small and thus 1/f(w(j)) extremely large. 

Apply Maurer's lemma:

find 0<q1<1, so that we reject items whose key is greater than q1.

let Y(j) = 1 if (X(j) < q1),  Y(j) = 0 otherwise {j = 1 to n}

Y(j) are independent random variables.

E(Y(j)) = Pr(X(j) < q1) * 1 + Pr(X(j) >= q1) * 0 
           = Pr (U * f(w(j)) < q1) = Pr(U < q1 / f(w(j)))
           = q1 / f(w(j))

E(Y(j) ^ 2) = E(Y(j)) = q1 / f(w(j))

set Y = sum(Y(j), j = 1 to n), which is the number of items whose associated 
random key is less than q1,

we have E(Y) = sum(E(Y(j))) = q1 * sum(1/f(w(j)), j = 1 to n)

we want the probability Pr{Y <= p * n} <= err which could be translated to 
Pr{E(Y) - Y >= E(Y) - p * n} <= err,

apply Maurer's lemma, we get
     
      log(Pr{E(Y) - Y >= E(Y) - p * n}) <= - t * t / (2 * sum(E(Y(j) ^ 2) <= 
log(err), j = 1 to n)), t = E(Y) - p * n = q1 * sum(1/f(w(j)), j = 1 to n) - p 
* n
                                              
Finally we get q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n 
* log(err)) } / { sum(1 / f(w(j)), j = 1 to n) }.

Use f(w(j)) defined at the beginning of this comment, { sum(1 / f(w(j)), j = 1 
to n) } =  sum((1.0 + w(j) / upperBound), j = 1 to n)  = n + sum(w(j), j = 1 to 
n) / upperBound

q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / 
{n + (sum(w(j), j = 1 to n) / upperBound) }
       = p * (n / {n + (sum(w(j), j = 1 to n) / upperBound) })
          +  { - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } 
/ {n + (sum(w(j), j = 1 to n) / upperBound) }

The final q1 we use is: p +  ({ - log(err) + sqrt(log(err) * log(err) - 2 * p * 
n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) })

Apply Berstein's lemma:

find 0<q2<1, so that we accept items whose key is less than q2.

let Z(j) = 1 if (X(j) < q2),  Z(j) = 0 otherwise {j = 1 to n}

Z(j) are independent random variables. 

E(Z(j)) = Pr{X(j) < q2} = Pr{U * f(w(j)) < q2} = Pr{U < q2 / f(w(j)) } = q2 / 
f(w(j))

E(Z(j) ^ 2) = E(Z(j))

Z(j) - E(Z(j)) = Z(j) - q2 / f(w(j)) <= M = 1

theta(j)^2 = E(Z(j) ^ 2) - E(Z(j)) * E(Z(j)) <= E(Z(j) ^ 2) = q2 / f(w(j))

set Z = sum(Z(j), j = 1 to n), which is the number of items whose associated 
random key is less than q2,

we have E(Z) = sum(E(Z(j)), j = 1 to n) = sum(q2 / f(w(j)), j = 1 to n) = q2 * 
sum(1/f(w(j)), j = 1 to n)

we want the probability Pr{Z >= p * n} <= err which could be translated to Pr{Z 
- E(Z)  >= p * n - E(Z)} <= err,

let t = p * n - E(Z) = p * n - q2 * sum(1/f(w(j)), j = 1 to n) > 0.

apply Berstein's lemma, we get 

       logPr{Z - E(Z)  >= p * n - E(Z)}  <= - t * t / (2 * sum(theta(j) ^ 2, j 
= 1 to n) + 2 * M * t / 3) <= log(err)

Finally we get q2 <= { p * n - 2 * log(err) / 3 - 2 * sqrt( log(err) * log(err) 
- 9 * p * n * log(err) / 2) / 3 } / sum(1 / f(w(j)), j = 1 to n).

Observation:
                     (1) q1 is decreasing  as more items are processed;
                     (2) q2 is increasing as more items are processed
                     (3) f(w(j)) = 1.0 / (1.0 + weight / upperBound), if we use 
a global static upperBound, f(w(j)) keeps static. However, this requires us to 
pre-calculate the upperBound or require customer to input an empirical number. 
We could use upperBound local to mapper, combiner and reducer. As more items 
are processed, since upperBound is non-decreasing, f(w(j)) will be 
non-decreasing when more items are processed from mapper to reducer;
                     (4) Compared with a chosen baseline weighted sampling 
algorithm, the distribution of  sampling probability relative to item weight is 
similar but items with higher weights have relative higher sampled frequencies 
in the baseline weighted sampling algorithm.

Here is a report of simulation experiment: 
https://docs.google.com/document/d/1oCNr7yoIFjwnXx8-24mmV4jY_9X4fdIWJzNW1TTMnqI/edit#,
 please help provide feedback on the experiment result.

Conclusion: 
                   (1) We could reject items whose keys are greater than q1 and 
accept items whose keys are smaller than q2 if the global upper bound is given;

                   (2) We could reject items using q1 when global upperBound is 
not given;

                   (3) The item sampling probability of using both q1 and q2 
compared with using only q1 is close;
 
                   (4)  The sampling probability is relative higher in items 
with higher weights, but the probability difference of the higher weight items 
compared with lower weight items is less significant compared with our chosen 
baseline weighted sampling algorithm

                   (5) Suggest customer to provide a global upper bound of item 
weights.



was (Author: king821221):
Further investigation updates:

Still consider the possibility of rejecting items applying Maurer's lemma and 
accepting items applying Bernstein's lemma.
But switch to a different key to associate with each item to simplify the 
computation and avoid discretization.

The different key I plan to use for each item is : X(j) = U * f(w(j)), U is a 
random variable between (0,1), f(w(j)) is a function of w(j) whose value  lies 
in (0, 1), so X(j) is a variable that also lies in (0,1).  f(w(j)) I choose is 
:  f(w(j)) = 1.0 / (1.0 + w(j) / upperBound), upperBound is the upper limit of 
w(j), which is approximated as max{w(j), j = 1 to n},  so 1/f(w(j))  = 1.0 + 
w(j) / upperBound. Note: f(w(j)) is in [0.5, 1), 1 / f(w(j)) is in (1,2],  X(j) 
= U * f(w(j)) is in [0,1).

The reason to use upperBound is to normalize the weights to the (0,1] range so 
that extremely large absolute weights will not make f(w(j)) , if without 
upperBound, extremely small and thus 1/f(w(j)) extremely large. 

Apply Maurer's lemma:

find 0<q1<1, so that we reject items whose key is greater than q1.

let Y(j) = 1 if (X(j) < q1),  Y(j) = 0 otherwise {j = 1 to n}

Y(j) are independent random variables.

E(Y(j)) = Pr(X(j) < q1) * 1 + Pr(X(j) >= q1) * 0 
           = Pr (U * f(w(j)) < q1) = Pr(U < q1 / f(w(j)))
           = q1 / f(w(j))

E(Y(j) ^ 2) = E(Y(j)) = q1 / f(w(j))

set Y = sum(Y(j), j = 1 to n), which is the number of items whose associated 
random key is less than q1,

we have E(Y) = sum(E(Y(j))) = q1 * sum(1/f(w(j)), j = 1 to n)

we want the probability Pr{Y <= p * n} <= err which could be translated to 
Pr{E(Y) - Y >= E(Y) - p * n} <= err,

apply Maurer's lemma, we get
     
      log(Pr{E(Y) - Y >= E(Y) - p * n}) <= - t * t / (2 * sum(E(Y(j) ^ 2) <= 
log(err), j = 1 to n)), t = E(Y) - p * n = q1 * sum(1/f(w(j)), j = 1 to n) - p 
* n
                                              
Finally we get q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n 
* log(err)) } / { sum(1 / f(w(j)), j = 1 to n) }.

Use f(w(j)) defined at the beginning of this comment, { sum(1 / f(w(j)), j = 1 
to n) } =  sum((1.0 + w(j) / upperBound), j = 1 to n)  = n + sum(w(j), j = 1 to 
n) / upperBound

q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / 
{n + (sum(w(j), j = 1 to n) / upperBound) }
       = p * (n / {n + (sum(w(j), j = 1 to n) / upperBound) })
          +  { - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } 
/ {n + (sum(w(j), j = 1 to n) / upperBound) }

The final q1 we use is: p +  ({ - log(err) + sqrt(log(err) * log(err) - 2 * p * 
n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) })

Apply Berstein's lemma:

find 0<q2<1, so that we accept items whose key is less than q2.

let Z(j) = 1 if (X(j) < q2),  Z(j) = 0 otherwise {j = 1 to n}

Z(j) are independent random variables. 

E(Z(j)) = Pr{X(j) < q2} = Pr{U * f(w(j)) < q2} = Pr{U < q2 / f(w(j)) } = q2 / 
f(w(j))

E(Z(j) ^ 2) = E(Z(j))

Z(j) - E(Z(j)) = Z(j) - q2 / f(w(j)) <= M = 1

theta(j)^2 = E(Z(j) ^ 2) - E(Z(j)) * E(Z(j)) <= E(Z(j) ^ 2) = q2 / f(w(j))

set Z = sum(Z(j), j = 1 to n), which is the number of items whose associated 
random key is less than q2,

we have E(Z) = sum(E(Z(j)), j = 1 to n) = sum(q2 / f(w(j)), j = 1 to n) = q2 * 
sum(1/f(w(j)), j = 1 to n)

we want the probability Pr{Z >= p * n} <= err which could be translated to Pr{Z 
- E(Z)  >= p * n - E(Z)} <= err,

let t = p * n - E(Z) = p * n - q2 * sum(1/f(w(j)), j = 1 to n) > 0.

apply Berstein's lemma, we get 

       logPr{Z - E(Z)  >= p * n - E(Z)}  <= - t * t / (2 * sum(theta(j) ^ 2, j 
= 1 to n) + 2 * M * t / 3) <= log(err)

Finally we get q2 <= { p * n - 2 * log(err) / 3 - 2 * sqrt( log(err) * log(err) 
- 9 * p * n * log(err) / 2) / 3 } / sum(1 / f(w(j)), j = 1 to n).

Observation:
                     (1) q1 is decreasing  as more items are processed;
                     (2) q2 is increasing as more items are processed
                     (3) f(w(j)) = 1.0 / (1.0 + weight / upperBound), if we use 
a global static upperBound, f(w(j)) keeps static. However, this requires us to 
pre-calculate the upperBound or require customer to input an empirical number. 
We could use upperBound local to mapper, combiner and reducer. As more items 
are processed, since upperBound is non-decreasing, f(w(j)) will be 
non-decreasing when more items are processed from mapper to reducer.
                     
Here is a report of simulation experiment: 
https://docs.google.com/document/d/1oCNr7yoIFjwnXx8-24mmV4jY_9X4fdIWJzNW1TTMnqI/edit#,
 please help provide feedback on the experiment result.

Conclusion: 
                   (1) We could reject items whose keys are greater than q1 and 
accept items whose keys are smaller than q2 if the global upper bound is given;

                   (2) We could reject items using q1 when global upperBound is 
not given;

                   (3) The item sampling probability of using both q1 and q2 
compared with using only q1 is close;
 
                   (4) Compared with a chosen baseline weighted sampling 
algorithm, the distribution of  sampling probability relative to item weight is 
similar but items with higher weights have relative higher sampled frequencies 
in the baseline weighted sampling algorithm

                    (4) Suggest customer to provide a global upper bound of 
item weights.


> Probability weighted sampling without reservoir
> -----------------------------------------------
>
>                 Key: DATAFU-21
>                 URL: https://issues.apache.org/jira/browse/DATAFU-21
>             Project: DataFu
>          Issue Type: New Feature
>         Environment: Mac OS, Linux
>            Reporter: jian wang
>            Assignee: jian wang
>
> This issue is used to track investigation on finding a weighted sampler 
> without using internal reservoir. 
> At present, the SimpleRandomSample has implemented a good 
> acceptance-rejection sampling algo on probability random sampling. The 
> weighted sampler could utilize the simple random sample with slight 
> modification.
> One slight modification is:  the present simple random sample generates a 
> uniform random number lies between (0, 1) as the random variable to accept or 
> reject an item. The weighted sample may generate this random variable based 
> on the item's weight and this random number still lies between (0, 1) and 
> each item's random variable remain independent between each other.
> Need further think and experiment the correctness of this solution and how to 
> implement it in an effective way.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to