[
https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944325#comment-13944325
]
jian wang edited comment on DATAFU-21 at 4/1/14 1:26 PM:
---------------------------------------------------------
Some investigation updates:
Based on the theories from paper:
http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf, I plan to
associate each item with a key X(j) = 1 - pow(U, 1/w(j)), U is a random
variable between (0,1). Then follow the thought of Random Sort, we sort the
items in ascending order based on X(j) and select the smallest k = p * n items.
Also as simple random sampling algorithm, we could also consider the
possibility of rejecting items applying Maurer's lemma and accepting items
applying Bernstein's lemma.
Apply Maurer's lemma:
we would like to find 0<q1<1, so that we reject items whose key is greater
than q1.
let Y(j) = 1 if (X(j) < q1)
= 0 otherwise
{Y(j), j = 1 to n} are independent random variables.
E(Y(j)) = Pr(X(j) < q1) * 1 + Pr(X(j) >= q1) * 0
= Pr(1 - pow( U, 1/w(j) ) < q1)
= Pr(1 - q1 < pow( U, 1/w(j) ))
= Pr(pow(1 - q1, w(j) ) < U) = 1 - pow( 1 - q1, w(j) )
E(Y(j) ^ 2) = E(Y(j)) = 1 - pow(1 - q1, w(j) )
set Y = sum(Y(j), j = 1 to n),
Q1 = sum( pow( 1-q1, w(j) ) , j = 1 to n)
E(Y) = sum(E(Y(j))) = n - sum( pow( 1-q1, w(j) ) ) = n - Q1
apply Maurer's lemma with t = (1 - p) * n - sum( pow(1 - q1, w(j) ) ) = (1 - p)
* n - Q1, since t > 0, Q1 < (1 - p) * n. Solving the inequality, I get
abs( Q1 - (1 - p) * n - log(err) ) >= sqrt( log(err) ^ 2 - 2 * p * n *
log(err) )
Further assume Q1 < (1 - p) * n + log(err), which also satisfies t > 0, get
Q1 <= (1 - p) * n + log(err) - sqrt( log(err) ^ 2 - 2 * p * n *
log(err) ) (1)
we could get q1 by solving (1)
Apply Berstein's lemma:
similar to applying Maurer's lemma, we could get a q2 so that we could accept
item whose key is smaller than q2, 0 <= q2 <= 1.
let Z(j) = 1 if X(j) < q2,
= 0 if X(j) >= q2
{Z(j), j = 1 to n} are independent random variables.
E(Z(j)) = Pr(X(j) < q2) * 1 + Pr(X(j) >= q2) * 0
= Pr(1 - pow( U, 1/w(j) ) < q2)
= 1 - pow(1 - q2, w(j) )
E(Z(j) ^ 2) = E(Z(j))
Z(j) - E(Z(j)) <= 1 - E(Z(j)) = pow(1 - q2, w(j) ) <= 1 = M
theta(j) ^ 2 = E(Z(j) ^ 2) - E(Z(j)) ^ 2 <= E(Z(j) ^ 2) = 1 - pow(1 - q2, w(j) )
set Z = sum(Z(j), j = 1 to n)
Q2 = sum( pow(1 - q2, w(j) ) )
E(Z) = sum(E(Z(j)), j = 1 to n) = n - sum( pow(1 - q2, w(j) ) ) = n - Q2
apply Berstein's lemma with t = sum( pow(1 - q2, w(j) ) ) - (1 - p) * n = Q2 -
(1 - p) * n, I get
Q2 >= n * (1 - p) + 2 / 3 * log(err) + 2 / 3 * sqrt( log(err) * (log(err)
- 9 * n * p / 2 ) ) (2)
we could get q2 by solving (2)
Questions:
(1) Please help comment on the above approach. Do they overall make sense?
(2) I am stuck in getting q1 and q2 by solving (1) and (2) respectively. Would
like to seek some advice on it.
Some thoughts on how to resolve this, eg: solve (1)
Q1 <= (1 - p) * n + log(err) - sqrt( log(err) ^ 2 - 2 * p * n * log(err) ) =
F(n, p, err) (1)
Q1 = sum( pow( 1-q1, w(j) ) , j = 1 to n), 0 < q1 < 1
Remove the less-than inequality of (1), our target is to get an approximate q1
that makes Q1 close to F(n, p, err), we name it as q1_t.
we could observe that:
(1) the value of Q1 decreases with the increase of q1.
(2) F(n, p, err) >= Q1 >= sum( pow( 1-q1, wmax ) , j = 1 to n) = n * pow(
1 - q1, wmax), wmax is MAX( w(j), j = 1 to n ), we could get a lower bound of
q1, q1 >= 1 - pow( F(n, p, err) / n, 1/wmax), this lower bound decreases when
wmax and n increases. Then we start from the lower bound and try Newton–Raphson
method to approach a better q1 that makes the value of Q1 close to F(n, p,
err). After a certain number of iterations, we assign the final value of the
predicted q1 to q1_t.
The newton code would be like:
/*
* 1 iteration of Newton–Raphson
* The real-valued function is: f(q) = (1 - q) ^ w(0) + (1 - q) ^ w(1) + ... +
(1 - q) ^ w(n - 1) - F(n, p, err)
* the function's derivative is: f'(q) = -1 * [w(0) * (1 - q) ^ (w(0) - 1) +
(1 - q) ^ (w(1) - 1) + ... (1 - q) ^ (w(n - 1) - 1)]
* given an initial value of q, calculate a better value of q' = q - f(q) /
f'(q)
* param q: initial value of q
* param F: F(n, p, err)
* param weights: {w(j), j = 0 to n -1}
*/
static double newton(double q, double F, List<Double> weights)
{
double fq = 0;
double fdq = 0;
for (Double weight : weights) {
fq += Math.pow(1.0 - q, weight);
fdq += -1 * Math.pow(1.0 - q, weight - 1.0) * weight;
}
fq -= F;
return q - fq / fdq;
}
was (Author: king821221):
Some investigation updates:
Based on the theories from paper:
http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf, I plan to
associate each item with a key X(j) = 1 - pow(U, 1/w(j)), U is a random
variable between (0,1). Then follow the thought of Random Sort, we sort the
items in ascending order based on X(j) and select the smallest k = p * n items.
Also as simple random sampling algorithm, we could also consider the
possibility of rejecting items applying Maurer's lemma and accepting items
applying Bernstein's lemma.
Apply Maurer's lemma:
we would like to find 0<q1<1, so that we reject items whose key is greater
than q1.
let Y(j) = 1 if (X(j) < q1)
= 0 otherwise
{Y(j), j = 1 to n} are independent random variables.
E(Y(j)) = Pr(X(j) < q1) * 1 + Pr(X(j) >= q1) * 0
= Pr(1 - pow( U, 1/w(j) ) < q1)
= Pr(1 - q1 < pow( U, 1/w(j) ))
= Pr(pow(1 - q1, w(j) ) < U) = 1 - pow( 1 - q1, w(j) )
E(Y(j) ^ 2) = E(Y(j)) = 1 - pow(1 - q1, w(j) )
set Y = sum(Y(j), j = 1 to n),
Q1 = sum( pow( 1-q1, w(j) ) , j = 1 to n)
E(Y) = sum(E(Y(j))) = n - sum( pow( 1-q1, w(j) ) ) = n - Q1
apply Maurer's lemma with t = (1 - p) * n - sum( pow(1 - q1, w(j) ) ) = (1 - p)
* n - Q1, since t > 0, Q1 < (1 - p) * n. Solving the inequality, I get
abs( Q1 - (1 - p) * n - log(err) ) >= sqrt( log(err) ^ 2 - 2 * p * n *
log(err) )
Further assume Q1 < (1 - p) * n + log(err), which also satisfies t > 0, get
Q1 <= (1 - p) * n + log(err) - sqrt( log(err) ^ 2 - 2 * p * n *
log(err) ) (1)
we could get q1 by solving (1)
Apply Berstein's lemma:
similar to applying Maurer's lemma, we could get a q2 so that we could accept
item whose key is smaller than q2, 0 <= q2 <= 1.
let Z(j) = 1 if X(j) < q2,
= 0 if X(j) >= q2
{Z(j), j = 1 to n} are independent random variables.
E(Z(j)) = Pr(X(j) < q2) * 1 + Pr(X(j) >= q2) * 0
= Pr(1 - pow( U, 1/w(j) ) < q2)
= 1 - pow(1 - q2, w(j) )
E(Z(j) ^ 2) = E(Z(j))
Z(j) - E(Z(j)) <= 1 - E(Z(j)) = pow(1 - q2, w(j) ) <= 1 = M
theta(j) ^ 2 = E(Z(j) ^ 2) - E(Z(j)) ^ 2 <= E(Z(j) ^ 2) = 1 - pow(1 - q2, w(j) )
set Z = sum(Z(j), j = 1 to n)
Q2 = sum( pow(1 - q2, w(j) ) )
E(Z) = sum(E(Z(j)), j = 1 to n) = n - sum( pow(1 - q2, w(j) ) ) = n - Q2
apply Berstein's lemma with t = sum( pow(1 - q2, w(j) ) ) - (1 - p) * n = Q2 -
(1 - p) * n, I get
Q2 >= n * (1 - p) + 2 / 3 * log(err) + 2 / 3 * sqrt( log(err) * (log(err)
- 9 * n * p / 2 ) ) (2)
we could get q2 by solving (2)
Questions:
(1) Please help comment on the above approach. Do they overall make sense?
(2) I am stuck in getting q1 and q2 by solving (1) and (2) respectively. Would
like to seek some advice on it.
Some thoughts on how to resolve this, eg: solve (1)
Q1 <= (1 - p) * n + log(err) - sqrt( log(err) ^ 2 - 2 * p * n * log(err) ) =
F(n, p, err) (1)
Q1 = sum( pow( 1-q1, w(j) ) , j = 1 to n), 0 < q1 < 1
Remove the less-than inequality of (1), our target is to get an approximate q1
that makes Q1 close to F(n, p, err), we name it as q1_t.
we could observe that:
(1) the value of Q1 decreases with the increase of q1.
(2) F(n, p, err) >= Q1 >= sum( pow( 1-q1, wmax ) , j = 1 to n) = n * pow(
1 - q1, wmax), wmax is MAX( w(j), j = 1 to n ), we could get a lower bound of
q1, q1 >= 1 - pow( F(n, p, err) / n, 1/wmax), this lower bound decreases when
wmax and n increases. Then we start from the lower bound and try Newton–Raphson
method to approach a better q1 that makes the value of Q1 close to F(n, p,
err). After a certain number of iterations, we assign the final value of the
predicted q1 to q1_t.
The newton code would be like:
/*
* 1 iteration of Newton–Raphson
* The real-valued function is: f(q) = (1 - q) ^ w(0) + (1 - q) ^ w(1) + ... +
(1 - q) ^ w(n - 1) - F(n, p, err)
* the function's derivative is: f'(q) = -1 * [w(0) * (1 - q) ^ (w(0) - 1) +
(1 - q) ^ (w(1) - 1) + ... (1 - q) ^ (w(n - 1) - 1)]
* given an initial value of q, calculate a better value of q' = q - f(q) /
f'(q)
* param q: initial value of q
* param F: F(n, p, err)
* param weights: {w(j), j = 0 to n -1}
*/
static double newton(double q, double F, List<Double> weights)
{
double fq = 0;
double fdq = 0;
for(Double weight : weights) {
fq += Math.pow(1.0 - q, weight);
fdq += -1 * Math.pow(1.0 - q, weight - 1.0) * weight;
}
fq -= F;
return q - fq / fdq;
}
> Probability weighted sampling without reservoir
> -----------------------------------------------
>
> Key: DATAFU-21
> URL: https://issues.apache.org/jira/browse/DATAFU-21
> Project: DataFu
> Issue Type: New Feature
> Environment: Mac OS, Linux
> Reporter: jian wang
> Assignee: jian wang
>
> This issue is used to track investigation on finding a weighted sampler
> without using internal reservoir.
> At present, the SimpleRandomSample has implemented a good
> acceptance-rejection sampling algo on probability random sampling. The
> weighted sampler could utilize the simple random sample with slight
> modification.
> One slight modification is: the present simple random sample generates a
> uniform random number lies between (0, 1) as the random variable to accept or
> reject an item. The weighted sample may generate this random variable based
> on the item's weight and this random number still lies between (0, 1) and
> each item's random variable remain independent between each other.
> Need further think and experiment the correctness of this solution and how to
> implement it in an effective way.
--
This message was sent by Atlassian JIRA
(v6.2#6252)