[ https://issues.apache.org/jira/browse/DATAFU-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875504#comment-13875504 ]
jian wang commented on DATAFU-2: -------------------------------- OK, I will create a separate patch without exp jump and move the streaming entropy from datafu.pig.stats.entropy.stream to datafu.pig.stats.entropy > UDFs for entropy and weighted sampling algorithms > ------------------------------------------------- > > Key: DATAFU-2 > URL: https://issues.apache.org/jira/browse/DATAFU-2 > Project: DataFu > Issue Type: Task > Reporter: Matthew Hayes > Attachments: 0001-create-initial-version-of-entroy-UDFs.patch, > 0002-update-a-few-comments-and-error-messages.patch, > 0003-fix-a-bug-in-Entropy.accumulate-to-use-getFreq-metho.patch, > 0004-update-entropy-implementation-following-code-review-.patch, > 0005-update-javadocs.patch, 0006-update-javadocs.patch, > 0007-update-the-javadocs-of-streaming-empirical-entropy-a.patch, > 0008-update-entropy-udfs-based-on-code-review.patch, > 0009-Implement-and-experiment-with-different-weighted-sam.patch, > 0010-update-weighted-reservoir-sampler-constructor-unit-t.patch, > 0011-implement-weighted-reservoir-sampling-algorithm-with.patch, > 0012-update-licence-headers-and-remove-un-used-import-cla.patch, > 0013-update-licence-headers-of-entropy-unit-tests.patch > > > Jian Wang has suggested that we add UDFs for entropy and weighted random > sampling and has implementations for each of these ready. > In Jian's words: > "In the real world, there are occasions we need to calculate the entropy of > discrete random variables, for instance, to calculate the mutual information > between variable X and Y using its entropy-based formula(mutual information > calculation could be found at > http://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities). > Would suggest to implement a UDF to calculate the entropy of given input > samples, following the definition at > http://en.wikipedia.org/wiki/Entropy_%28information_theory%29 > This is the reference paper I use to learn about the weighted sampleing > algorithm: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf > The present WeightedSample.java implements the Algorithm D. > We may try Algorithm A, A-res and A-expJ since they could be used in a data > stream and distributed environment. These algorithms could be implemented > based on ReservoirSample.java(inherit from this class?) since they also need > a reservior to store the selected items." -- This message was sent by Atlassian JIRA (v6.1.5#6160)