[ 
https://issues.apache.org/jira/browse/DATAFU-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Hayes closed DATAFU-2.
------------------------------


> UDFs for entropy and weighted sampling algorithms
> -------------------------------------------------
>
>                 Key: DATAFU-2
>                 URL: https://issues.apache.org/jira/browse/DATAFU-2
>             Project: DataFu
>          Issue Type: Task
>            Reporter: Matthew Hayes
>            Assignee: Matthew Hayes
>             Fix For: 1.3.0
>
>         Attachments: 0001-create-initial-version-of-entroy-UDFs.patch, 
> 0002-update-a-few-comments-and-error-messages.patch, 
> 0003-fix-a-bug-in-Entropy.accumulate-to-use-getFreq-metho.patch, 
> 0004-update-entropy-implementation-following-code-review-.patch, 
> 0005-update-javadocs.patch, 0006-update-javadocs.patch, 
> 0007-update-the-javadocs-of-streaming-empirical-entropy-a.patch, 
> 0008-update-entropy-udfs-based-on-code-review.patch, 
> 0009-Implement-and-experiment-with-different-weighted-sam.patch, 
> 0010-update-weighted-reservoir-sampler-constructor-unit-t.patch, 
> 0011-update-licence-headers-and-move-streaming-entropy-to.patch, 
> 0012-add-missing-licence-header.patch
>
>
> Jian Wang has suggested that we add UDFs for entropy and weighted random 
> sampling and has implementations for each of these ready.
> In Jian's words:
> "In the real world, there are occasions we need to calculate the entropy of 
> discrete random variables, for instance, to calculate the mutual information 
> between variable X and Y using its entropy-based formula(mutual information 
> calculation could be found at 
> http://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities).
>  Would suggest to implement a UDF to calculate the entropy of given input 
> samples, following the definition at 
> http://en.wikipedia.org/wiki/Entropy_%28information_theory%29
> This is the reference paper I use to learn about the weighted sampleing 
> algorithm: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf
> The present WeightedSample.java implements the Algorithm D.
> We may try Algorithm A, A-res and A-expJ since they could be used in a data 
> stream and distributed environment. These algorithms could be implemented 
> based on ReservoirSample.java(inherit from this class?) since they also need 
> a reservior to store the selected items."



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to