[ https://issues.apache.org/jira/browse/DATAFU-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918377#comment-13918377 ]
Matthew Hayes commented on DATAFU-26: ------------------------------------- Committed: 4aa2ef2a425dab3e3d5f5bdacf095af0b18fb993 > Resolve entropy UDF naming conventions > -------------------------------------- > > Key: DATAFU-26 > URL: https://issues.apache.org/jira/browse/DATAFU-26 > Project: DataFu > Issue Type: Task > Reporter: Matthew Hayes > Assignee: jian wang > Fix For: 1.3.0 > > Attachments: 0001-update-entropy-naming-conventions.patch > > > There are a couple issues with the naming of entropy UDFs that we should work > out before the next release. > StreamingEntropy supports multiple estimation methods. Entropy however only > support empirical. The supported constructors are also different as a > result. Although Entropy's documentation states it computes the empirical > entropy, I think the name itself may lead to confusion. > StreamingEntropy takes data the data in sorted order. Using this sorted data > it computes count, which are then used to compute entropy. Entropy on the > other hand takes counts directly and computes entropy. These counts need to > be computed before calling it. Our convention in DataFu has been that > "Streaming" implies that the data does not need to be sorted. So > StreamingEntropy is in conflict with this. > My proposal is: > 1) Rename Entropy to EmpiricalEntropy > 2) Rename StreamingEntropy to Entropy > 3) Clearly document why you would use EmpiricalEntropy over Entropy. It will > be more efficient in some scenarios and we should explain this. > One open question I have is whether we should distinguish in the name somehow > that EmpiricalEntropy accepts counts, not the actual items themselves. > EmpiricalCountBasedEntropy seems verbose. -- This message was sent by Atlassian JIRA (v6.2#6252)