[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF
[ https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13895791#comment-13895791 ] jian wang commented on DATAFU-16: - Matt, Do you think we go ahead to implement the exponential jump only for the accumulate-based model? And for algebraic, we still use the weighted reservoir sampling without exponential jump. The good part of introducing the exp jump: it could improve the job performance, especially when there is a lot of data to process, without sacrificing much on the sampling precision(per-item sampling probability is close to w/sum(w)). The not good part: the chance of using accumulate-based model may not be as many as algebraic, so is it worthwhile to introduce this enhancement? weighted reservoir sampling with exponential jumps UDF -- Key: DATAFU-16 URL: https://issues.apache.org/jira/browse/DATAFU-16 Project: DataFu Issue Type: New Feature Environment: Mac, Linux pig-0.11 Reporter: jian wang Priority: Minor Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, WeightedSamplingCorrectnessTests.java Create a weightedReservoirSampleWithExpJump UDF to implement the weighted reservoir sampling algorithm with exponential jumps. Investigation is tracked in https://github.com/linkedin/datafu/issues/80. This task is part of experiment of different weighted sampling algorithms. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (DATAFU-28) Tests are too slow
[ https://issues.apache.org/jira/browse/DATAFU-28?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13895817#comment-13895817 ] jian wang commented on DATAFU-28: - Matt, do you have stats of individual test cases for datafu.test.pig.stats.entropy and datafu.test.pig.sampling? Which ant option or tool do you use to measure the running duration of the test cases? Tests are too slow -- Key: DATAFU-28 URL: https://issues.apache.org/jira/browse/DATAFU-28 Project: DataFu Issue Type: Bug Reporter: Matthew Hayes I ran the tests on my laptop and it took nearly 2 hours. The worst offenders are {{datafu.test.pig.sampling}}, {{datafu.test.pig.stats}}, and {{datafu.test.pig.stats.entropy}}. ||Package ||Tests|| Failures|| Duration|| Success rate|| |datafu.test.pig.bags|27 |0| 1m10.72s|100%| |datafu.test.pig.geo |1 |0 |9.757s |100%| |datafu.test.pig.hash|4 |0 |41.039s| 100%| |datafu.test.pig.linkanalysis|5 |0| 32.677s |100%| |datafu.test.pig.random |1| 0| 11.789s|100%| |datafu.test.pig.sampling |25|0 |38m25.81s| 100%| |datafu.test.pig.sessions |7 |0 |2m50.67s |100%| |datafu.test.pig.sets |9 |0 |5m46.70s |100%| |datafu.test.pig.stats| 52| 0 |26m11.98s| 100%| |datafu.test.pig.stats.entropy|40|0 |31m30.97s |100%| |datafu.test.pig.urls|1 |0 |1m35.24s |100%| |datafu.test.pig.util|21 |0| 4m51.64s|100%| -- This message was sent by Atlassian JIRA (v6.1.5#6160)