[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF

2014-02-08 Thread jian wang (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895791#comment-13895791
 ] 

jian wang commented on DATAFU-16:
-

Matt, Do you think we go ahead to implement the exponential jump only for the 
accumulate-based model? And for algebraic, we still use the weighted reservoir 
sampling without exponential jump. 

The good part of introducing the exp jump:  it could improve the job 
performance, especially when there is a lot of data to process, without 
sacrificing much on the sampling precision(per-item sampling probability is 
close to w/sum(w)). 

The not good part: the chance of using accumulate-based model may not be as 
many as algebraic, so is it worthwhile to introduce this enhancement?

> weighted reservoir sampling with exponential jumps UDF
> --
>
> Key: DATAFU-16
> URL: https://issues.apache.org/jira/browse/DATAFU-16
> Project: DataFu
>  Issue Type: New Feature
> Environment: Mac, Linux
> pig-0.11
>Reporter: jian wang
>Priority: Minor
> Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, 
> WeightedSamplingCorrectnessTests.java
>
>
> Create a weightedReservoirSampleWithExpJump UDF to implement the weighted 
> reservoir sampling algorithm with exponential jumps. Investigation is tracked 
> in  https://github.com/linkedin/datafu/issues/80. This task is part of 
> experiment of different weighted sampling algorithms.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (DATAFU-28) Tests are too slow

2014-02-08 Thread jian wang (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-28?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895817#comment-13895817
 ] 

jian wang commented on DATAFU-28:
-

Matt, do you have stats of individual test cases for 
datafu.test.pig.stats.entropy  and datafu.test.pig.sampling?

Which ant option or tool do you use to measure the running duration of the test 
cases?

> Tests are too slow
> --
>
> Key: DATAFU-28
> URL: https://issues.apache.org/jira/browse/DATAFU-28
> Project: DataFu
>  Issue Type: Bug
>Reporter: Matthew Hayes
>
> I ran the tests on my laptop and it took nearly 2 hours.
> The worst offenders are {{datafu.test.pig.sampling}}, 
> {{datafu.test.pig.stats}}, and {{datafu.test.pig.stats.entropy}}.
> ||Package  ||Tests||  Failures||  Duration||  Success rate||
> |datafu.test.pig.bags|27  |0| 1m10.72s|100%|
> |datafu.test.pig.geo  |1  |0  |9.757s |100%|
> |datafu.test.pig.hash|4   |0  |41.039s|   100%|
> |datafu.test.pig.linkanalysis|5   |0| 32.677s |100%|
> |datafu.test.pig.random   |1| 0|  11.789s|100%|
> |datafu.test.pig.sampling |25|0   |38m25.81s| 100%|
> |datafu.test.pig.sessions |7  |0  |2m50.67s   |100%|
> |datafu.test.pig.sets |9  |0  |5m46.70s   |100%|
> |datafu.test.pig.stats|   52| 0   |26m11.98s| 100%|
> |datafu.test.pig.stats.entropy|40|0   |31m30.97s  |100%|
> |datafu.test.pig.urls|1   |0  |1m35.24s   |100%|
> |datafu.test.pig.util|21  |0| 4m51.64s|100%|



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)