[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF
[ https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587387#comment-15587387 ] Matthew Hayes commented on DATAFU-16: - I don't think the exponential jump version got added. > weighted reservoir sampling with exponential jumps UDF > -- > > Key: DATAFU-16 > URL: https://issues.apache.org/jira/browse/DATAFU-16 > Project: DataFu > Issue Type: New Feature > Environment: Mac, Linux > pig-0.11 >Reporter: jian wang >Assignee: jian wang >Priority: Minor > Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, > WeightedSamplingCorrectnessTests.java > > > Create a weightedReservoirSampleWithExpJump UDF to implement the weighted > reservoir sampling algorithm with exponential jumps. Investigation is tracked > in https://github.com/linkedin/datafu/issues/80. This task is part of > experiment of different weighted sampling algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF
[ https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586085#comment-15586085 ] Eyal Allweil commented on DATAFU-16: It looks like this got added - can this issue be closed? > weighted reservoir sampling with exponential jumps UDF > -- > > Key: DATAFU-16 > URL: https://issues.apache.org/jira/browse/DATAFU-16 > Project: DataFu > Issue Type: New Feature > Environment: Mac, Linux > pig-0.11 >Reporter: jian wang >Assignee: jian wang >Priority: Minor > Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, > WeightedSamplingCorrectnessTests.java > > > Create a weightedReservoirSampleWithExpJump UDF to implement the weighted > reservoir sampling algorithm with exponential jumps. Investigation is tracked > in https://github.com/linkedin/datafu/issues/80. This task is part of > experiment of different weighted sampling algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF
[ https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13897828#comment-13897828 ] jian wang commented on DATAFU-16: - I have updated the WeightedSamplingCorrectnessTests.java and there is a simulated perf test within. Following is the ouptut of the test. [testng] *** Running reservoirExpJPerfTest *** [testng] Output: [testng] accumulateDuration accumulateExpJDuration [testng] 8563 1563 accumulateDuration: test duration for weighted sampling without exp jump in accumulate mode accumulateExpJDuration: test duration for weighted sampling with exp jump unit is milliseconds weighted reservoir sampling with exponential jumps UDF -- Key: DATAFU-16 URL: https://issues.apache.org/jira/browse/DATAFU-16 Project: DataFu Issue Type: New Feature Environment: Mac, Linux pig-0.11 Reporter: jian wang Priority: Minor Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java Create a weightedReservoirSampleWithExpJump UDF to implement the weighted reservoir sampling algorithm with exponential jumps. Investigation is tracked in https://github.com/linkedin/datafu/issues/80. This task is part of experiment of different weighted sampling algorithms. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF
[ https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13895791#comment-13895791 ] jian wang commented on DATAFU-16: - Matt, Do you think we go ahead to implement the exponential jump only for the accumulate-based model? And for algebraic, we still use the weighted reservoir sampling without exponential jump. The good part of introducing the exp jump: it could improve the job performance, especially when there is a lot of data to process, without sacrificing much on the sampling precision(per-item sampling probability is close to w/sum(w)). The not good part: the chance of using accumulate-based model may not be as many as algebraic, so is it worthwhile to introduce this enhancement? weighted reservoir sampling with exponential jumps UDF -- Key: DATAFU-16 URL: https://issues.apache.org/jira/browse/DATAFU-16 Project: DataFu Issue Type: New Feature Environment: Mac, Linux pig-0.11 Reporter: jian wang Priority: Minor Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, WeightedSamplingCorrectnessTests.java Create a weightedReservoirSampleWithExpJump UDF to implement the weighted reservoir sampling algorithm with exponential jumps. Investigation is tracked in https://github.com/linkedin/datafu/issues/80. This task is part of experiment of different weighted sampling algorithms. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF
[ https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881922#comment-13881922 ] jian wang commented on DATAFU-16: - Experiment to test the algorithm's item sample probability estimation correctness using the same methodology described in the original issue: https://github.com/linkedin/datafu/issues/80. Using (weight / sum(weight)) as the ground truth of each item's sampling probability, calculate the average squared error of the algo's per item sampling probability. Using exponential jump in weighted reservoir sampling in accumulate mode seems OK, but it is not sure if it is OK for algebraic mode since it has higher error than other algos. [is verifying the test code to see if it is something wrong with test code] The testAccumulateExpJ() to simulate Accumulate() for data stream The testAlgebraicExpJ() to simulate the Initial/Interm/Final using 100 combiners and each initial processes only one sample, which is the majority of real world cases. Experiment result err_ws: 1.174525314652248E-5 err_acc: 1.1883407123610779E-5 err_alg: 1.2130630748818072E-5 err_skip_acc: 1.2081897301243E-5 err_skip_alg: 1.3854125917604345E-4 err_ws is for weighted sampling UDF err_acc is for weighted reservoir sampling accumulate err_alg is for weighted reservoir sampling algebraic err_skip_acc is for weighted reservoir sampling exponential jump accumulate err_skip_alg is for weighted reservoir sampling exponential jump algebraic Pls see test code as attached weighted reservoir sampling with exponential jumps UDF -- Key: DATAFU-16 URL: https://issues.apache.org/jira/browse/DATAFU-16 Project: DataFu Issue Type: New Feature Environment: Mac, Linux pig-0.11 Reporter: jian wang Priority: Minor Create a weightedReservoirSampleWithExpJump UDF to implement the weighted reservoir sampling algorithm with exponential jumps. Investigation is tracked in https://github.com/linkedin/datafu/issues/80. This task is part of experiment of different weighted sampling algorithms. -- This message was sent by Atlassian JIRA (v6.1.5#6160)