[ https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895791#comment-13895791 ]
jian wang commented on DATAFU-16: --------------------------------- Matt, Do you think we go ahead to implement the exponential jump only for the accumulate-based model? And for algebraic, we still use the weighted reservoir sampling without exponential jump. The good part of introducing the exp jump: it could improve the job performance, especially when there is a lot of data to process, without sacrificing much on the sampling precision(per-item sampling probability is close to w/sum(w)). The not good part: the chance of using accumulate-based model may not be as many as algebraic, so is it worthwhile to introduce this enhancement? > weighted reservoir sampling with exponential jumps UDF > ------------------------------------------------------ > > Key: DATAFU-16 > URL: https://issues.apache.org/jira/browse/DATAFU-16 > Project: DataFu > Issue Type: New Feature > Environment: Mac, Linux > pig-0.11 > Reporter: jian wang > Priority: Minor > Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, > WeightedSamplingCorrectnessTests.java > > > Create a weightedReservoirSampleWithExpJump UDF to implement the weighted > reservoir sampling algorithm with exponential jumps. Investigation is tracked > in https://github.com/linkedin/datafu/issues/80. This task is part of > experiment of different weighted sampling algorithms. -- This message was sent by Atlassian JIRA (v6.1.5#6160)