[ https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075634#comment-14075634 ]
jian wang edited comment on DATAFU-21 at 7/27/14 3:04 PM: ---------------------------------------------------------- Further investigation updates: Still consider the possibility of rejecting items applying Maurer's lemma and accepting items applying Bernstein's lemma. But switch to a different key to associate with each item to simplify the computation and avoid discretization. The different key I plan to use for each item is : X(j) = U * f(w(j)), U is a random variable between (0,1), f(w(j)) is a function of w(j) whose value lies in (0, 1), so X(j) is a variable that also lies in (0,1). f(w(j)) I choose is : f(w(j)) = 1.0 / (1.0 + w(j) / upperBound), upperBound is the upper limit of w(j), which is approximated as max{w(j), j = 1 to n}, so 1/f(w(j)) = 1.0 + w(j) / upperBound. Note: f(w(j)) is in [0.5, 1), 1 / f(w(j)) is in (1,2], X(j) = U * f(w(j)) is in [0,1). The reason to use upperBound is to normalize the weights to the (0,1] range so that extremely large absolute weights will not make f(w(j)) , if without upperBound, extremely small and thus 1/f(w(j)) extremely large. Apply Maurer's lemma: find 0<q1<1, so that we reject items whose key is greater than q1. let Y(j) = 1 if (X(j) < q1), = 0 otherwise {j = 1 to n} Y(1), Y(2)...Y(n) are independent random variables. E(Y(j)) = Pr(X(j) < q1) * 1 + Pr(X(j) >= q1) * 0 = Pr(U * f(w(j)) < q1) = Pr(U < q1 / f(w(j))) = q1 / f(w(j)) E(Y(j) ^ 2) = E(Y(j)) = q1 / f(w(j)) set Y = sum(Y(j), j = 1 to n), which is the number of items whose associated random key is less than q1, we have E(Y) = sum(E(Y(j))) = q1 * sum(1/f(w(j)), j = 1 to n) we want the probability Pr{Y <= p * n} <= err which could be translated to Pr{E(Y) - Y >= E(Y) - p * n} <= err, apply Maurer's lemma, we get log(Pr{E(Y) - Y >= E(Y) - p * n}) <= - t * t / (2 * sum(E(Y(j) ^ 2) <= log(err), j = 1 to n)), t = E(Y) - p * n = q1 * sum(1/f(w(j)), j = 1 to n) - p * n Finally we get q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / { sum(1 / f(w(j)), j = 1 to n) }. Use f(w(j)) defined at the beginning of this comment, { sum(1 / f(w(j)), j = 1 to n) } = sum((1.0 + w(j) / upperBound), j = 1 to n) = n + sum(w(j), j = 1 to n) / upperBound q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) } = p * (n / {n + (sum(w(j), j = 1 to n) / upperBound) }) + { - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) } The final q1 we get is: p + ({ - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) }) Apply Berstein's lemma: find 0<q2<1, so that we accept items whose key is less than q2. let Z(j) = 1 if (X(j) < q2), = 0 otherwise {j = 1 to n} E(Z(j)) = Pr{X(j) < q2} = Pr{U * f(w(j)) < q2} = Pr { U < q2 / f(w(j)) } = q2 / f(w(j)) E(Z(j) ^ 2) = E(Z(j)) Z(j) - E(Z(j)) = Z(j) - q2 / f(w(j)) <= M = 1 theta(j)^2 = E(Z(j) ^ 2) - E(Z(j)) * E(Z(j)) <= E(Z(j) ^ 2) = q2 / f(w(j)) set Z = sum(Z(j), j = 1 to n), which is the number of items whose associated random key is less than q2, we have E(Z) = sum(E(Z(j)), j = 1 to n) = sum(q2 / f(w(j)), j = 1 to n) = q2 * sum(1/f(w(j)), j = 1 to n) we want the probability Pr{Z >= p * n} <= err which could be translated to Pr{ Z - E(Z) >= p * n - E(Z)} <= err, let t = p * n - E(Z) = p * n - q2 * sum(1/f(w(j)), j = 1 to n) > 0. apply Berstein's lemma, we get logPr{ Z - E(Z) >= p * n - E(Z)} <= - t * t / (2 * sum(theta(j) ^ 2, j = 1 to n) + 2 * M * t / 3) <= log(err) Finally we get q2 <= { p * n - 2 * log(err) / 3 - 2 * sqrt( log(err) * log(err) - 9 * p * n * log(err) / 2) / 3 } / sum(1 / f(w(j)), j = 1 to n). Observation: (1) q1 is decreasing as more items are processed; (2) q2 is increasing as more items are processed (3) f(w(j)) = 1.0 / (1.0 + weight / upperBound), if we use a global static upperBound, f(w(j)) keeps static. However, this requires us to pre-calculate the upperBound or require customer to input an empirical number. We could use upperBound local to mapper, combiner and reducer. As more items are processed, since upperBound is non-decreasing, f(w(j)) will be non-decreasing when more items are processed from mapper to reducer. Here is a report of simulation experiment: https://docs.google.com/document/d/1oCNr7yoIFjwnXx8-24mmV4jY_9X4fdIWJzNW1TTMnqI/edit# Conclusion: (1) We could reject items whose keys are greater than q1 and accept items whose keys are smaller than q2 if the global upper bound is given; (2) We could reject items using q1 when global upperBound is not given; (3) The item sampling probability of using both q1 and q2 compared with using only q1 is close; (4) Suggest customer to provide a global upper bound of item weights. was (Author: king821221): Further investigation updates: Still consider the possibility of rejecting items applying Maurer's lemma and accepting items applying Bernstein's lemma. But switch to a different key to associate with each item to simplify the computation and avoid discretization. The different key I plan to use for each item is : X(j) = U * f(w(j)), U is a random variable between (0,1), f(w(j)) is a function of w(j) whose value lies in (0, 1), so X(j) is a variable that also lies in (0,1). f(w(j)) I choose is : f(w(j)) = 1.0 / (1.0 + w(j) / upperBound), upperBound is the upper limit of w(j), which is approximated as max{w(j), j = 1 to n}, so 1/f(w(j)) = 1.0 + w(j) / upperBound. Note: f(w(j)) is in [0.5, 1), 1 / f(w(j)) is in (1,2], X(j) = U * f(w(j)) is in [0,1). The reason to use upperBound is to normalize the weights to the (0,1] range so that extremely large absolute weights will not make f(w(j)) , if without upperBound, extremely small and thus 1/f(w(j)) extremely large. Apply Maurer's lemma: find 0<q1<1, so that we reject items whose key is greater than q1. let Y(j) = 1 if (X(j) < q1), = 0 otherwise {j = 1 to n} Y(1), Y(2)... Y(n) are independent random variables. E(Y(j)) = Pr(X(j) < q1) * 1 + Pr(X(j) >= q1) * 0 = Pr(U * f(w(j)) < q1) = Pr(U < q1 / f(w(j))) = q1 / f(w(j)) E(Y(j) ^ 2) = E(Y(j)) = q1 / f(w(j)) set Y = sum(Y(j), j = 1 to n), which is the number of items whose associated random key is less than q1, we have E(Y) = sum(E(Y(j))) = q1 * sum(1/f(w(j)), j = 1 to n) we want the probability Pr{Y <= p * n} <= err which could be translated to Pr{E(Y) - Y >= E(Y) - p * n} <= err, apply Maurer's lemma, we get log(Pr{E(Y) - Y >= E(Y) - p * n}) <= - t * t / (2 * sum(E(Y(j) ^ 2) <= log(err), j = 1 to n)), t = E(Y) - p * n = q1 * sum(1/f(w(j)), j = 1 to n) - p * n Finally we get q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / { sum(1 / f(w(j)), j = 1 to n) }. Use f(w(j)) defined at the beginning of this comment, { sum(1 / f(w(j)), j = 1 to n) } = sum((1.0 + w(j) / upperBound), j = 1 to n) = n + sum(w(j), j = 1 to n) / upperBound q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) } = p * (n / {n + (sum(w(j), j = 1 to n) / upperBound) }) + { - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) } The final q1 we get is: p + ({ - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) }) Apply Berstein's lemma: find 0<q2<1, so that we accept items whose key is less than q2. let Z(j) = 1 if (X(j) < q2), = 0 otherwise {j = 1 to n} E(Z(j)) = Pr{X(j) < q2} = Pr{U * f(w(j)) < q2} = Pr { U < q2 / f(w(j)) } = q2 / f(w(j)) E(Z(j) ^ 2) = E(Z(j)) Z(j) - E(Z(j)) = Z(j) - q2 / f(w(j)) <= M = 1 theta(j)^2 = E(Z(j) ^ 2) - E(Z(j)) * E(Z(j)) <= E(Z(j) ^ 2) = q2 / f(w(j)) set Z = sum(Z(j), j = 1 to n), which is the number of items whose associated random key is less than q2, we have E(Z) = sum(E(Z(j)), j = 1 to n) = sum(q2 / f(w(j)), j = 1 to n) = q2 * sum(1/f(w(j)), j = 1 to n) we want the probability Pr{Z >= p * n} <= err which could be translated to Pr{ Z - E(Z) >= p * n - E(Z)} <= err, let t = p * n - E(Z) = p * n - q2 * sum(1/f(w(j)), j = 1 to n) > 0. apply Berstein's lemma, we get logPr{ Z - E(Z) >= p * n - E(Z)} <= - t * t / (2 * sum(theta(j) ^ 2, j = 1 to n) + 2 * M * t / 3) <= log(err) Finally we get q2 <= { p * n - 2 * log(err) / 3 - 2 * sqrt( log(err) * log(err) - 9 * p * n * log(err) / 2) / 3 } / sum(1 / f(w(j)), j = 1 to n). Observation: (1) q1 is decreasing as more items are processed; (2) q2 is increasing as more items are processed (3) f(w(j)) = 1.0 / (1.0 + weight / upperBound), if we use a global static upperBound, f(w(j)) keeps static. However, this requires us to pre-calculate the upperBound or require customer to input an empirical number. We could use upperBound local to mapper, combiner and reducer. As more items are processed, since upperBound is non-decreasing, f(w(j)) will be non-decreasing when more items are processed from mapper to reducer. Here is a report of simulation experiment: https://docs.google.com/document/d/1oCNr7yoIFjwnXx8-24mmV4jY_9X4fdIWJzNW1TTMnqI/edit# Conclusion: (1) We could reject items whose keys are greater than q1 and accept items whose keys are smaller than q2 if the global upper bound is given; (2) We could reject items using q1 when global upperBound is not given; (3) The item sampling probability of using both q1 and q2 compared with using only q1 is close; (4) Suggest customer to provide a global upper bound of item weights. > Probability weighted sampling without reservoir > ----------------------------------------------- > > Key: DATAFU-21 > URL: https://issues.apache.org/jira/browse/DATAFU-21 > Project: DataFu > Issue Type: New Feature > Environment: Mac OS, Linux > Reporter: jian wang > Assignee: jian wang > > This issue is used to track investigation on finding a weighted sampler > without using internal reservoir. > At present, the SimpleRandomSample has implemented a good > acceptance-rejection sampling algo on probability random sampling. The > weighted sampler could utilize the simple random sample with slight > modification. > One slight modification is: the present simple random sample generates a > uniform random number lies between (0, 1) as the random variable to accept or > reject an item. The weighted sample may generate this random variable based > on the item's weight and this random number still lies between (0, 1) and > each item's random variable remain independent between each other. > Need further think and experiment the correctness of this solution and how to > implement it in an effective way. -- This message was sent by Atlassian JIRA (v6.2#6252)