[
https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075634#comment-14075634
]
jian wang commented on DATAFU-21:
---------------------------------
Further investigation updates:
Still consider the possibility of rejecting items applying Maurer's lemma and
accepting items applying Bernstein's lemma.
But switch to a different key to associate with each item to simplify the
computation and avoid discretization.
The different key I plan to use for each item is : X(j) = U * f(w(j)), U is a
random variable between (0,1), f(w(j)) is a function of w(j) whose value lies
in (0, 1), so X(j) is a variable that also lies in (0,1). f(w(j)) I choose is
: f(w(j)) = 1.0 / (1.0 + w(j) / upperBound), upperBound is the upper limit of
w(j), which is approximated as max{w(j), j = 1 to n}, so 1/f(w(j)) = 1.0 +
w(j) / upperBound. Note: f(w(j)) is in [0.5, 1), 1 / f(w(j)) is in (1,2], X(j)
= U * f(w(j)) is in [0,1).
The reason to use upperBound is to normalize the weights to the (0,1] range so
that extremely large absolute weights will not make f(w(j)) , if without
upperBound, extremely small and thus 1/f(w(j)) extremely large.
Apply Maurer's lemma:
find 0<q1<1, so that we reject items whose key is greater than q1.
let Y(j) = 1 if (X(j) < q1), = 0 otherwise {j = 1 to n}
Y(1), Y(2)... Y(n) are independent random variables.
E(Y(j)) = Pr(X(j) < q1) * 1 + Pr(X(j) >= q1) * 0
= Pr(U * f(w(j)) < q1)
= Pr(U < q1 / f(w(j)))
= q1 / f(w(j))
E(Y(j) ^ 2) = E(Y(j)) = q1 / f(w(j))
set Y = sum(Y(j), j = 1 to n), which is the number of items whose associated
random key is less than q1,
we have E(Y) = sum(E(Y(j))) = q1 * sum(1/f(w(j)), j = 1 to n)
we want the probability Pr{Y <= p * n} <= err which could be translated to
Pr{E(Y) - Y >= E(Y) - p * n} <= err,
apply Maurer's lemma, we get
log(Pr{E(Y) - Y >= E(Y) - p * n}) <= - t * t / (2 * sum(E(Y(j) ^ 2) <=
log(err), j = 1 to n)), t = E(Y) - p * n = q1 * sum(1/f(w(j)), j = 1 to n) - p
* n
Finally we get q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n
* log(err)) } / { sum(1 / f(w(j)), j = 1 to n) }.
Use f(w(j)) defined at the beginning of this comment, { sum(1 / f(w(j)), j = 1
to n) } = sum((1.0 + w(j) / upperBound), j = 1 to n)
= n + sum(w(j), j = 1 to n) /
upperBound
q1 >= { p * n - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) } /
{n + (sum(w(j), j = 1 to n) / upperBound) }
= p * (n / {n + (sum(w(j), j = 1 to n) / upperBound) })
+ { - log(err) + sqrt(log(err) * log(err) - 2 * p * n * log(err)) }
/ {n + (sum(w(j), j = 1 to n) / upperBound) }
The final q1 we get is: p + ({ - log(err) + sqrt(log(err) * log(err) - 2 * p *
n * log(err)) } / {n + (sum(w(j), j = 1 to n) / upperBound) })
Apply Berstein's lemma:
find 0<q2<1, so that we accept items whose key is less than q2.
let Z(j) = 1 if (X(j) < q2), = 0 otherwise {j = 1 to n}
E(Z(j)) = Pr{X(j) < q2} = Pr{U * f(w(j)) < q2} = Pr { U < q2 / f(w(j)) } = q2 /
f(w(j))
E(Z(j) ^ 2) = E(Z(j))
Z(j) - E(Z(j)) = Z(j) - q2 / f(w(j)) <= M = 1
theta(j)^2 = E(Z(j) ^ 2) - E(Z(j)) * E(Z(j)) <= E(Z(j) ^ 2) = q2 / f(w(j))
set Z = sum(Z(j), j = 1 to n), which is the number of items whose associated
random key is less than q2,
we have E(Z) = sum(E(Z(j)), j = 1 to n) = sum(q2 / f(w(j)), j = 1 to n) = q2 *
sum(1/f(w(j)), j = 1 to n)
we want the probability Pr{Z >= p * n} <= err which could be translated to Pr{
Z - E(Z) >= p * n - E(Z)} <= err,
let t = p * n - E(Z) = p * n - q2 * sum(1/f(w(j)), j = 1 to n) > 0.
apply Berstein's lemma, we get
logPr{ Z - E(Z) >= p * n - E(Z)} <= - t * t / (2 * sum(theta(j) ^ 2, j
= 1 to n) + 2 * M * t / 3) <= log(err)
Finally we get q2 <= { p * n - 2 * log(err) / 3 - 2 * sqrt( log(err) * log(err)
- 9 * p * n * log(err) / 2) / 3 } / sum(1 / f(w(j)), j = 1 to n).
Observation:
(1) q1 is decreasing as more items are processed;
(2) q2 is increasing as more items are processed
(3) f(w(j)) = 1.0 / (1.0 + weight / upperBound), if we use
a global static upperBound, f(w(j)) keeps static. However, this requires us to
pre-calculate the upperBound or require customer to input an empirical number.
We could use upperBound local to mapper, combiner and reducer. As more items
are processed, since upperBound is non-decreasing, f(w(j)) will be
non-decreasing when more items are processed from mapper to reducer.
Here is a report of simulation experiment:
https://docs.google.com/document/d/1oCNr7yoIFjwnXx8-24mmV4jY_9X4fdIWJzNW1TTMnqI/edit#
Conclusion:
(1) We could reject items whose keys are greater than q1 and
accept items whose keys are smaller than q2 if the global upper bound is given;
(2) We could reject items using q1 when global upperBound is
not given;
(3) The item sampling probability of using both q1 and q2
compared with using only q1 is close;
(4) Suggest customer to provide a global upper bound of
item weights.
> Probability weighted sampling without reservoir
> -----------------------------------------------
>
> Key: DATAFU-21
> URL: https://issues.apache.org/jira/browse/DATAFU-21
> Project: DataFu
> Issue Type: New Feature
> Environment: Mac OS, Linux
> Reporter: jian wang
> Assignee: jian wang
>
> This issue is used to track investigation on finding a weighted sampler
> without using internal reservoir.
> At present, the SimpleRandomSample has implemented a good
> acceptance-rejection sampling algo on probability random sampling. The
> weighted sampler could utilize the simple random sample with slight
> modification.
> One slight modification is: the present simple random sample generates a
> uniform random number lies between (0, 1) as the random variable to accept or
> reject an item. The weighted sample may generate this random variable based
> on the item's weight and this random number still lies between (0, 1) and
> each item's random variable remain independent between each other.
> Need further think and experiment the correctness of this solution and how to
> implement it in an effective way.
--
This message was sent by Atlassian JIRA
(v6.2#6252)