[
https://issues.apache.org/jira/browse/HIVEMALL-184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Makoto Yui closed HIVEMALL-184.
-------------------------------
Resolution: Abandoned
> Add an optimizer rule to filter out columns by using Mutual Information
> -----------------------------------------------------------------------
>
> Key: HIVEMALL-184
> URL: https://issues.apache.org/jira/browse/HIVEMALL-184
> Project: Hivemall
> Issue Type: Sub-task
> Reporter: Takeshi Yamamuro
> Assignee: Takeshi Yamamuro
> Priority: Major
> Labels: spark
>
> Mutual Information (MI) is an indicator to find and quantify dependencies
> between variables, so the indicator is useful to filter out columns in
> feature selection. Nearest-neighbor distances are frequently used to estimate
> MI [1], so we could use the distances to compute MI between columns for each
> relation when running an ANALYZE command. Then, we could filter out "similar"
> columns in the optimizer phase by referring a new threshold (e.g.
> `spark.sql.optimizer.featureSelection.mutualInfoThreshold`).
> In another story, we need to consider a light-weight way to update MI when
> re-running an ANALYZE command. A recent study [2] proposed a sophisticated
> technique to compute MI for dynamic data.
> [1] Dafydd Evans, A computationally efficient estimator for mutual
> information. In Proceedings of the Royal Society of London A: Mathematical,
> Physical
> and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008.
> [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual
> Information Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)