Matthew Hayes created DATAFU-26:
-----------------------------------

             Summary: Resolve entropy UDF naming conventions
                 Key: DATAFU-26
                 URL: https://issues.apache.org/jira/browse/DATAFU-26
             Project: DataFu
          Issue Type: Task
            Reporter: Matthew Hayes
             Fix For: 1.3.0


There are a couple issues with the naming of entropy UDFs that we should work 
out before the next release.

StreamingEntropy supports multiple estimation methods.  Entropy however only 
support empirical.  The supported constructors are also different as a result.  
Although Entropy's documentation states it computes the empirical entropy, I 
think the name itself may lead to confusion.  

StreamingEntropy takes data the data in sorted order.  Using this sorted data 
it computes count, which are then used to compute entropy.  Entropy on the 
other hand takes counts directly and computes entropy.  These counts need to be 
computed before calling it.  Our convention in DataFu has been that "Streaming" 
implies that the data does not need to be sorted.  So StreamingEntropy is in 
conflict with this.

My proposal is:

1) Rename Entropy to EmpiricalEntropy
2) Rename StreamingEntropy to Entropy
3) Clearly document why you would use EmpiricalEntropy over Entropy.  It will 
be more efficient in some scenarios and we should explain this.

One open question I have is whether we should distinguish in the name somehow 
that EmpiricalEntropy accepts counts, not the actual items themselves.  
EmpiricalCountBasedEntropy seems verbose.




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to