[
https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881922#comment-13881922
]
jian wang commented on DATAFU-16:
---------------------------------
Experiment to test the algorithm's item sample probability estimation
correctness using the same methodology described in the original issue:
https://github.com/linkedin/datafu/issues/80.
Using (weight / sum(weight)) as the ground truth of each item's sampling
probability, calculate the average squared error of the algo's per item
sampling probability.
Using exponential jump in weighted reservoir sampling in accumulate mode seems
OK, but it is not sure if it is OK for algebraic mode since it has higher error
than other algos. [is verifying the test code to see if it is something wrong
with test code]
The testAccumulateExpJ() to simulate Accumulate() for data stream
The testAlgebraicExpJ() to simulate the Initial/Interm/Final using 100
combiners and each initial processes only one sample, which is the majority of
real world cases.
Experiment result
err_ws: 1.174525314652248E-5
err_acc: 1.1883407123610779E-5
err_alg: 1.2130630748818072E-5
err_skip_acc: 1.2081897301243E-5
err_skip_alg: 1.3854125917604345E-4
err_ws is for weighted sampling UDF
err_acc is for weighted reservoir sampling accumulate
err_alg is for weighted reservoir sampling algebraic
err_skip_acc is for weighted reservoir sampling exponential jump accumulate
err_skip_alg is for weighted reservoir sampling exponential jump algebraic
Pls see test code as attached
> weighted reservoir sampling with exponential jumps UDF
> ------------------------------------------------------
>
> Key: DATAFU-16
> URL: https://issues.apache.org/jira/browse/DATAFU-16
> Project: DataFu
> Issue Type: New Feature
> Environment: Mac, Linux
> pig-0.11
> Reporter: jian wang
> Priority: Minor
>
> Create a weightedReservoirSampleWithExpJump UDF to implement the weighted
> reservoir sampling algorithm with exponential jumps. Investigation is tracked
> in https://github.com/linkedin/datafu/issues/80. This task is part of
> experiment of different weighted sampling algorithms.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)