[ 
https://issues.apache.org/jira/browse/SPARK-26172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-26172:
---------------------------------
    Description: 
For now, there are three ways to deal with case-insensitivity in ML:

1, support case-insensitivity, e.g. {{LogisticRegression}};

2, support case-insensitivity, but with getter returning the lower case value 
(not the value passed to setter), e.g. {{ALS}},{{DecisionTreeClassifier}};

3, do not support case-insensitivity, e.g. {{NaiveBayes}}

 

This situation result in confusion in usage. 

I think we should choose the *first* way to support case-insensitivity of all 
non-columnName string params, including:
 * LogisticRegression: family
 * MultilayerPerceptronClassifier: {{solver}}
 * NaiveBayes: modelType
 * DecisionTreeClassifier: impurity
 * RandomForestClassifier: featureSubsetStrategy, impurity
 * GBTClassifier: featureSubsetStrategy, impurity, {{lossType}}
 * {{}}
 * LinearRegression: solver, loss
 * GeneralizedLinearRegression: family, link, solver
 * DecisionTreeRegressor: impurity
 * RandomForestRegressor: featureSubsetStrategy, impurity
 * GBTRegressor: featureSubsetStrategy, impurity, {{lossType}}
 * {{}}
 * {\{KMeans: }}initMode
 * LDA: optimizer
 * PowerIterationClustering\{{: }}initMode
 * 
 * ALS: coldStartStrategy, intermediateStorageLevel, finalStorageLevel
 * 
 * Bucketizer: handleInvalid
 * ChiSqSelector: selectorType
 * Imputer: strategy
 * QuantileDiscretizer: handleInvalid
 * RFormula: handleInvalid, stringIndexerOrderType
 * StringIndexer: handleInvalid, stringOrderType
 * VectorAssembler: handleInvalid
 * VectorIndexer: handleInvalid
 * VectorSizeHint: handleInvalid
 * OneHotEncoderEstimator: handleInvalid (*this will be let alone until the 
breaking change*)
 * 
 * BinaryClassificationEvaluator: metricName
 * MulticlassClassificationEvaluator: metricName
 * RegressionEvaluator: metricName
 * ClusteringEvaluator: metricName, distanceMeasure

 

 

 

To to this:
 * methods {{lowerCaseInArray}} and {{upperCaseInArray}} are created in 
{{ParamValidators}} to check case-insensitivity;
 * methods  {{{{$$(param: Param[String])}}}} and {{%%(param: Param[String])}} 
are created in trait {{Params}} to lower/upper the param value conveniently, 
and this can minimize the modifications in existing codes, since in many cases 
we only need to change {{$(param)}} to {{$$\{param}}};
 * {{in \{{SharedParamsCodeGen}}}}, {{{{handleInvalid}}}} and 
{{distanceMeasure}} are updated to use  lowerCaseInArray

 

  was:
For now, there are three ways to deal with case-insensitivity in ML:

1, support case-insensitivity, e.g. {{LogisticRegression}};

2, support case-insensitivity, but with getter returning the lower case value 
(not the value passed to setter), e.g. {{ALS}},{{DecisionTreeClassifier}};

3, do not support case-insensitivity, e.g. {{NaiveBayes}}

 

This situation result in confusion in usage. 

I think we should choose the *first* way to support case-insensitivity of all 
non-columnName string params, including:
 * LogisticRegression: family
 * MultilayerPerceptronClassifier: {{solver}}
 * NaiveBayes: modelType
 * DecisionTreeClassifier: impurity
 * RandomForestClassifier: featureSubsetStrategy, impurity
 * GBTClassifier: featureSubsetStrategy, impurity, {{lossType}}
 * {{}}
 * LinearRegression: solver, loss
 * GeneralizedLinearRegression: family, link, solver
 * DecisionTreeRegressor: impurity
 * RandomForestRegressor: featureSubsetStrategy, impurity
 * GBTRegressor: featureSubsetStrategy, impurity, {{lossType}}
 * {{}}
 * {\{KMeans: }}initMode
 * LDA: optimizer
 * PowerIterationClustering\{{: }}initMode
 * 
 * ALS: coldStartStrategy, intermediateStorageLevel, finalStorageLevel
 * 
 * Bucketizer: handleInvalid
 * ChiSqSelector: selectorType
 * Imputer: strategy
 * QuantileDiscretizer: handleInvalid
 * RFormula: handleInvalid, stringIndexerOrderType
 * StringIndexer: handleInvalid, stringOrderType
 * VectorAssembler: handleInvalid
 * VectorIndexer: handleInvalid
 * VectorSizeHint: handleInvalid
 * OneHotEncoderEstimator: handleInvalid (*this will be let alone until the 
breaking change*)
 * 
 * BinaryClassificationEvaluator: metricName
 * MulticlassClassificationEvaluator: metricName
 * RegressionEvaluator: metricName
 * ClusteringEvaluator: metricName, distanceMeasure

 

 

 

To to this:
 * methods {{lowerCaseInArray}} and {{upperCaseInArray}} are created in 
{{ParamValidators}} to check case-insensitivity;
 * methods  {{{{$$(param: Param[String])}}}} and {{%%(param: Param[String])}} 
are created in trait {{Params}} to lower/upper the param value conveniently, 
and this can minimize the modifications in existing codes, since in many cases 
we only need to change {{$(param)}} to {{$$\{param}}};
 * {{in \{{}}SharedParamsCodeGen}}, {{{{handleInvalid}}}} and 
{{distanceMeasure}} are updated to use \{ {lowerCaseInArray}}

 

 


> Unify String Params' case-insensitivity in ML
> ---------------------------------------------
>
>                 Key: SPARK-26172
>                 URL: https://issues.apache.org/jira/browse/SPARK-26172
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 3.0.0
>            Reporter: zhengruifeng
>            Priority: Major
>
> For now, there are three ways to deal with case-insensitivity in ML:
> 1, support case-insensitivity, e.g. {{LogisticRegression}};
> 2, support case-insensitivity, but with getter returning the lower case value 
> (not the value passed to setter), e.g. {{ALS}},{{DecisionTreeClassifier}};
> 3, do not support case-insensitivity, e.g. {{NaiveBayes}}
>  
> This situation result in confusion in usage. 
> I think we should choose the *first* way to support case-insensitivity of all 
> non-columnName string params, including:
>  * LogisticRegression: family
>  * MultilayerPerceptronClassifier: {{solver}}
>  * NaiveBayes: modelType
>  * DecisionTreeClassifier: impurity
>  * RandomForestClassifier: featureSubsetStrategy, impurity
>  * GBTClassifier: featureSubsetStrategy, impurity, {{lossType}}
>  * {{}}
>  * LinearRegression: solver, loss
>  * GeneralizedLinearRegression: family, link, solver
>  * DecisionTreeRegressor: impurity
>  * RandomForestRegressor: featureSubsetStrategy, impurity
>  * GBTRegressor: featureSubsetStrategy, impurity, {{lossType}}
>  * {{}}
>  * {\{KMeans: }}initMode
>  * LDA: optimizer
>  * PowerIterationClustering\{{: }}initMode
>  * 
>  * ALS: coldStartStrategy, intermediateStorageLevel, finalStorageLevel
>  * 
>  * Bucketizer: handleInvalid
>  * ChiSqSelector: selectorType
>  * Imputer: strategy
>  * QuantileDiscretizer: handleInvalid
>  * RFormula: handleInvalid, stringIndexerOrderType
>  * StringIndexer: handleInvalid, stringOrderType
>  * VectorAssembler: handleInvalid
>  * VectorIndexer: handleInvalid
>  * VectorSizeHint: handleInvalid
>  * OneHotEncoderEstimator: handleInvalid (*this will be let alone until the 
> breaking change*)
>  * 
>  * BinaryClassificationEvaluator: metricName
>  * MulticlassClassificationEvaluator: metricName
>  * RegressionEvaluator: metricName
>  * ClusteringEvaluator: metricName, distanceMeasure
>  
>  
>  
> To to this:
>  * methods {{lowerCaseInArray}} and {{upperCaseInArray}} are created in 
> {{ParamValidators}} to check case-insensitivity;
>  * methods  {{{{$$(param: Param[String])}}}} and {{%%(param: Param[String])}} 
> are created in trait {{Params}} to lower/upper the param value conveniently, 
> and this can minimize the modifications in existing codes, since in many 
> cases we only need to change {{$(param)}} to {{$$\{param}}};
>  * {{in \{{SharedParamsCodeGen}}}}, {{{{handleInvalid}}}} and 
> {{distanceMeasure}} are updated to use  lowerCaseInArray
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to