[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Takeshi Yamamuro updated HIVEMALL-181: -------------------------------------- Labels: spark (was: ) > Plan rewrting rules to filter out meaningless columns before future selections > ------------------------------------------------------------------------------ > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement > Reporter: Takeshi Yamamuro > Assignee: Takeshi Yamamuro > Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is a useful techniqe to > choose a subset of relevant features > in model construction for simplification of models and shorter training times. > scikit-learn has some APIs for feature selection > (http://scikit-learn.org/stable/modules/feature_selection.html), but > this selection is too time-consuming process if training data have a large > number of columns > (the number could frequently go over 1,000 in bisiness use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > out meaningless columns before feature selection. > As a simple example, Spark might be able to filter out columns with low > variances (This process is corresponding to `VarianceThreshold` in > scikit-learn) > by implicitly adding a `Project` node in the top of an user plan. > Then, the Spark optimizer might push down this `Project` node into leaf nodes > (e.g., `LogicalRelation`) and > the plan execution could be significantly faster. > Moreover, more sophicated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functinalities) > in this ticket to track them. > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to > avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)