[ https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075634#comment-14075634 ]
jian wang edited comment on DATAFU-21 at 7/28/14 4:19 PM: ---------------------------------------------------------- Further investigation updates: Still consider the possibility of rejecting items applying Maurer's lemma and accepting items applying Bernstein's lemma. But switch to a different key to associate with each item to simplify the computation and avoid discretization. The different key I plan to use for each item is : X(j) = U * f(w(j)), U is a random variable between (0,1), f(w(j)) is a function of w(j) whose value lies in (0, 1), so X(j) is a variable that also lies in (0,1). f(w(j)) I choose is : f(w(j)) = 1.0 / (1.0 + w(j) / upperBound), upperBound is the upper limit of w(j), which is approximated as max{w(j), j = 1 to n}, so 1/f(w(j)) = 1.0 + w(j) / upperBound. Note: f(w(j)) is in [0.5, 1), 1 / f(w(j)) is in (1,2], X(j) = U * f(w(j)) is in [0,1). The reason to use upperBound is to normalize the weights to the (0,1] range so that extremely large absolute weights will not make f(w(j)) , if without upperBound, extremely small and thus 1/f(w(j)) extremely large. Apply Maurer's lemma: find 0<q1<1, so that we reject items whose key is greater than q1. let Y(j) = 1 if (X(j) < q1), Y(j) = 0 otherwise {j = 1 to n} Y(j) are independent random variables. E(Y(j)) = Pr(X(j) < q1) * 1 + Pr(X(j) >= q1) * 0 = Pr (U * f(w(j)) < q1) = Pr(U < q1 / f(w(j))) = q1 / f(w(j)) E(Y(j) ^ 2) = E(Y(j)) = q1 / f(w(j)) set Y = sum(Y(j), j = 1 to n), which is the number of items whose associated random key is less than q1, we have E(Y) = sum(E(Y(j))) = q1 * sum(1/f(w(j)), j = 1 to n) we want the probability Pr{Y <= p * n} <= err which could be translated to Pr{E(Y) - Y >= E(Y) - p * n} <= err, apply Maurer's lemma, we get log(Pr{E(Y) - Y >= E(Y) - p * n}) <= - t * t / (2 * sum(E(Y(j) ^ 2) <= log(err), j = 1 to n)), t = E(Y) - p * n = q1 * sum(1/f(w(j)), j = 1 to n) - p * n Finally we get q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / { sum(1 / f(w(j)), j = 1 to n) }. Use f(w(j)) defined at the beginning of this comment, { sum(1 / f(w(j)), j = 1 to n) } = sum((1.0 + w(j) / upperBound), j = 1 to n) = n + sum(w(j), j = 1 to n) / upperBound q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) } = p * (n / {n + (sum(w(j), j = 1 to n) / upperBound) }) + { - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) } The final q1 we use is: p + ({ - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) }) Apply Berstein's lemma: find 0<q2<1, so that we accept items whose key is less than q2. let Z(j) = 1 if (X(j) < q2), Z(j) = 0 otherwise {j = 1 to n} Z(j) are independent random variables. E(Z(j)) = Pr{X(j) < q2} = Pr{U * f(w(j)) < q2} = Pr{U < q2 / f(w(j)) } = q2 / f(w(j)) E(Z(j) ^ 2) = E(Z(j)) Z(j) - E(Z(j)) = Z(j) - q2 / f(w(j)) <= M = 1 theta(j)^2 = E(Z(j) ^ 2) - E(Z(j)) * E(Z(j)) <= E(Z(j) ^ 2) = q2 / f(w(j)) set Z = sum(Z(j), j = 1 to n), which is the number of items whose associated random key is less than q2, we have E(Z) = sum(E(Z(j)), j = 1 to n) = sum(q2 / f(w(j)), j = 1 to n) = q2 * sum(1/f(w(j)), j = 1 to n) we want the probability Pr{Z >= p * n} <= err which could be translated to Pr{Z - E(Z) >= p * n - E(Z)} <= err, let t = p * n - E(Z) = p * n - q2 * sum(1/f(w(j)), j = 1 to n) > 0. apply Berstein's lemma, we get logPr{Z - E(Z) >= p * n - E(Z)} <= - t * t / (2 * sum(theta(j) ^ 2, j = 1 to n) + 2 * M * t / 3) <= log(err) Finally we get q2 <= { p * n - 2 * log(err) / 3 - 2 * sqrt( log(err) * log(err) - 9 * p * n * log(err) / 2) / 3 } / sum(1 / f(w(j)), j = 1 to n). Observation: (1) q1 is decreasing as more items are processed; (2) q2 is increasing as more items are processed (3) f(w(j)) = 1.0 / (1.0 + weight / upperBound), if we use a global static upperBound, f(w(j)) keeps static. However, this requires us to pre-calculate the upperBound or require customer to input an empirical number. We could use upperBound local to mapper, combiner and reducer. As more items are processed, since upperBound is non-decreasing, f(w(j)) will be non-decreasing when more items are processed from mapper to reducer; (4) Compared with a chosen baseline weighted sampling algorithm, the distribution of sampling probability relative to item weight is similar but items with higher weights have relative higher sampled frequencies in the baseline weighted sampling algorithm. Here is a report of simulation experiment: https://docs.google.com/document/d/1oCNr7yoIFjwnXx8-24mmV4jY_9X4fdIWJzNW1TTMnqI/edit#, please help provide feedback on the experiment result. Conclusion: (1) We could reject items whose keys are greater than q1 and accept items whose keys are smaller than q2 if the global upper bound is given; (2) We could reject items using q1 when global upperBound is not given; (3) The item sampling probability of using both q1 and q2 compared with using only q1 is close; (4) The sampling probability is relative higher in items with higher weights, but the probability difference of the higher weight items compared with lower weight items is less significant compared with our chosen baseline weighted sampling algorithm (5) Suggest customer to provide a global upper bound of item weights. was (Author: king821221): Further investigation updates: Still consider the possibility of rejecting items applying Maurer's lemma and accepting items applying Bernstein's lemma. But switch to a different key to associate with each item to simplify the computation and avoid discretization. The different key I plan to use for each item is : X(j) = U * f(w(j)), U is a random variable between (0,1), f(w(j)) is a function of w(j) whose value lies in (0, 1), so X(j) is a variable that also lies in (0,1). f(w(j)) I choose is : f(w(j)) = 1.0 / (1.0 + w(j) / upperBound), upperBound is the upper limit of w(j), which is approximated as max{w(j), j = 1 to n}, so 1/f(w(j)) = 1.0 + w(j) / upperBound. Note: f(w(j)) is in [0.5, 1), 1 / f(w(j)) is in (1,2], X(j) = U * f(w(j)) is in [0,1). The reason to use upperBound is to normalize the weights to the (0,1] range so that extremely large absolute weights will not make f(w(j)) , if without upperBound, extremely small and thus 1/f(w(j)) extremely large. Apply Maurer's lemma: find 0<q1<1, so that we reject items whose key is greater than q1. let Y(j) = 1 if (X(j) < q1), Y(j) = 0 otherwise {j = 1 to n} Y(j) are independent random variables. E(Y(j)) = Pr(X(j) < q1) * 1 + Pr(X(j) >= q1) * 0 = Pr (U * f(w(j)) < q1) = Pr(U < q1 / f(w(j))) = q1 / f(w(j)) E(Y(j) ^ 2) = E(Y(j)) = q1 / f(w(j)) set Y = sum(Y(j), j = 1 to n), which is the number of items whose associated random key is less than q1, we have E(Y) = sum(E(Y(j))) = q1 * sum(1/f(w(j)), j = 1 to n) we want the probability Pr{Y <= p * n} <= err which could be translated to Pr{E(Y) - Y >= E(Y) - p * n} <= err, apply Maurer's lemma, we get log(Pr{E(Y) - Y >= E(Y) - p * n}) <= - t * t / (2 * sum(E(Y(j) ^ 2) <= log(err), j = 1 to n)), t = E(Y) - p * n = q1 * sum(1/f(w(j)), j = 1 to n) - p * n Finally we get q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / { sum(1 / f(w(j)), j = 1 to n) }. Use f(w(j)) defined at the beginning of this comment, { sum(1 / f(w(j)), j = 1 to n) } = sum((1.0 + w(j) / upperBound), j = 1 to n) = n + sum(w(j), j = 1 to n) / upperBound q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) } = p * (n / {n + (sum(w(j), j = 1 to n) / upperBound) }) + { - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) } The final q1 we use is: p + ({ - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) }) Apply Berstein's lemma: find 0<q2<1, so that we accept items whose key is less than q2. let Z(j) = 1 if (X(j) < q2), Z(j) = 0 otherwise {j = 1 to n} Z(j) are independent random variables. E(Z(j)) = Pr{X(j) < q2} = Pr{U * f(w(j)) < q2} = Pr{U < q2 / f(w(j)) } = q2 / f(w(j)) E(Z(j) ^ 2) = E(Z(j)) Z(j) - E(Z(j)) = Z(j) - q2 / f(w(j)) <= M = 1 theta(j)^2 = E(Z(j) ^ 2) - E(Z(j)) * E(Z(j)) <= E(Z(j) ^ 2) = q2 / f(w(j)) set Z = sum(Z(j), j = 1 to n), which is the number of items whose associated random key is less than q2, we have E(Z) = sum(E(Z(j)), j = 1 to n) = sum(q2 / f(w(j)), j = 1 to n) = q2 * sum(1/f(w(j)), j = 1 to n) we want the probability Pr{Z >= p * n} <= err which could be translated to Pr{Z - E(Z) >= p * n - E(Z)} <= err, let t = p * n - E(Z) = p * n - q2 * sum(1/f(w(j)), j = 1 to n) > 0. apply Berstein's lemma, we get logPr{Z - E(Z) >= p * n - E(Z)} <= - t * t / (2 * sum(theta(j) ^ 2, j = 1 to n) + 2 * M * t / 3) <= log(err) Finally we get q2 <= { p * n - 2 * log(err) / 3 - 2 * sqrt( log(err) * log(err) - 9 * p * n * log(err) / 2) / 3 } / sum(1 / f(w(j)), j = 1 to n). Observation: (1) q1 is decreasing as more items are processed; (2) q2 is increasing as more items are processed (3) f(w(j)) = 1.0 / (1.0 + weight / upperBound), if we use a global static upperBound, f(w(j)) keeps static. However, this requires us to pre-calculate the upperBound or require customer to input an empirical number. We could use upperBound local to mapper, combiner and reducer. As more items are processed, since upperBound is non-decreasing, f(w(j)) will be non-decreasing when more items are processed from mapper to reducer. Here is a report of simulation experiment: https://docs.google.com/document/d/1oCNr7yoIFjwnXx8-24mmV4jY_9X4fdIWJzNW1TTMnqI/edit#, please help provide feedback on the experiment result. Conclusion: (1) We could reject items whose keys are greater than q1 and accept items whose keys are smaller than q2 if the global upper bound is given; (2) We could reject items using q1 when global upperBound is not given; (3) The item sampling probability of using both q1 and q2 compared with using only q1 is close; (4) Compared with a chosen baseline weighted sampling algorithm, the distribution of sampling probability relative to item weight is similar but items with higher weights have relative higher sampled frequencies in the baseline weighted sampling algorithm (4) Suggest customer to provide a global upper bound of item weights. > Probability weighted sampling without reservoir > ----------------------------------------------- > > Key: DATAFU-21 > URL: https://issues.apache.org/jira/browse/DATAFU-21 > Project: DataFu > Issue Type: New Feature > Environment: Mac OS, Linux > Reporter: jian wang > Assignee: jian wang > > This issue is used to track investigation on finding a weighted sampler > without using internal reservoir. > At present, the SimpleRandomSample has implemented a good > acceptance-rejection sampling algo on probability random sampling. The > weighted sampler could utilize the simple random sample with slight > modification. > One slight modification is: the present simple random sample generates a > uniform random number lies between (0, 1) as the random variable to accept or > reject an item. The weighted sample may generate this random variable based > on the item's weight and this random number still lies between (0, 1) and > each item's random variable remain independent between each other. > Need further think and experiment the correctness of this solution and how to > implement it in an effective way. -- This message was sent by Atlassian JIRA (v6.2#6252)