[
https://issues.apache.org/jira/browse/DATAFU-26?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
jian wang updated DATAFU-26:
----------------------------
Attachment: 0001-update-entropy-naming-conventions.patch
> Resolve entropy UDF naming conventions
> --------------------------------------
>
> Key: DATAFU-26
> URL: https://issues.apache.org/jira/browse/DATAFU-26
> Project: DataFu
> Issue Type: Task
> Reporter: Matthew Hayes
> Assignee: jian wang
> Fix For: 1.3.0
>
> Attachments: 0001-update-entropy-naming-conventions.patch
>
>
> There are a couple issues with the naming of entropy UDFs that we should work
> out before the next release.
> StreamingEntropy supports multiple estimation methods. Entropy however only
> support empirical. The supported constructors are also different as a
> result. Although Entropy's documentation states it computes the empirical
> entropy, I think the name itself may lead to confusion.
> StreamingEntropy takes data the data in sorted order. Using this sorted data
> it computes count, which are then used to compute entropy. Entropy on the
> other hand takes counts directly and computes entropy. These counts need to
> be computed before calling it. Our convention in DataFu has been that
> "Streaming" implies that the data does not need to be sorted. So
> StreamingEntropy is in conflict with this.
> My proposal is:
> 1) Rename Entropy to EmpiricalEntropy
> 2) Rename StreamingEntropy to Entropy
> 3) Clearly document why you would use EmpiricalEntropy over Entropy. It will
> be more efficient in some scenarios and we should explain this.
> One open question I have is whether we should distinguish in the name somehow
> that EmpiricalEntropy accepts counts, not the actual items themselves.
> EmpiricalCountBasedEntropy seems verbose.
--
This message was sent by Atlassian JIRA
(v6.2#6252)