[ https://issues.apache.org/jira/browse/SPARK-26172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhengruifeng updated SPARK-26172: --------------------------------- Description: For now, there are three ways to deal with case-insensitivity in ML: 1, support case-insensitivity, e.g. {{LogisticRegression}}; 2, support case-insensitivity, but with getter returning the lower case value (not the value passed to setter), e.g. {{ALS}},{{DecisionTreeClassifier}}; 3, do not support case-insensitivity, e.g. {{NaiveBayes}} This situation result in confusion in usage. I think we should choose the *first* way to support case-insensitivity of all non-columnName string params, including: * LogisticRegression: family * MultilayerPerceptronClassifier: {{solver}} * NaiveBayes: modelType * DecisionTreeClassifier: impurity * RandomForestClassifier: featureSubsetStrategy, impurity * GBTClassifier: featureSubsetStrategy, impurity, {{lossType}} * {{}} * LinearRegression: solver, loss * GeneralizedLinearRegression: family, link, solver * DecisionTreeRegressor: impurity * RandomForestRegressor: featureSubsetStrategy, impurity * GBTRegressor: featureSubsetStrategy, impurity, {{lossType}} * {{}} * {\{KMeans: }}initMode * LDA: optimizer * PowerIterationClustering\{{: }}initMode * * ALS: coldStartStrategy, intermediateStorageLevel, finalStorageLevel * * Bucketizer: handleInvalid * ChiSqSelector: selectorType * Imputer: strategy * QuantileDiscretizer: handleInvalid * RFormula: handleInvalid, stringIndexerOrderType * StringIndexer: handleInvalid, stringOrderType * VectorAssembler: handleInvalid * VectorIndexer: handleInvalid * VectorSizeHint: handleInvalid * OneHotEncoderEstimator: handleInvalid (*this will be let alone until the breaking change*) * * BinaryClassificationEvaluator: metricName * MulticlassClassificationEvaluator: metricName * RegressionEvaluator: metricName * ClusteringEvaluator: metricName, distanceMeasure To to this: * methods {{lowerCaseInArray}} and {{upperCaseInArray}} are created in {{ParamValidators}} to check case-insensitivity; * methods {{{{$$(param: Param[String])}}}} and {{%%(param: Param[String])}} are created in trait {{Params}} to lower/upper the param value conveniently, and this can minimize the modifications in existing codes, since in many cases we only need to change {{$(param)}} to {{$$\{param}}}; * {{in \{{}}SharedParamsCodeGen}}, {{{{handleInvalid}}}} and {{distanceMeasure}} are updated to use \{ {lowerCaseInArray}} was: For now, there are three ways to deal with case-insensitivity in ML: 1, support case-insensitivity, e.g. {{LogisticRegression}}; 2, support case-insensitivity, but with getter returning the lower case value (not the value passed to setter), e.g. {{ALS}},{{DecisionTreeClassifier}}; 3, do not support case-insensitivity, e.g. {{NaiveBayes}} This situation result in confusion in usage. I think we should choose the *first* way to support case-insensitivity of all non-columnName string params, including: * LogisticRegression: family * MultilayerPerceptronClassifier: {{solver}} * NaiveBayes: modelType * DecisionTreeClassifier: impurity * RandomForestClassifier: featureSubsetStrategy, impurity * GBTClassifier: featureSubsetStrategy, impurity, {{lossType}} * {{}} * LinearRegression: solver, loss * GeneralizedLinearRegression: family, link, solver * DecisionTreeRegressor: impurity * RandomForestRegressor: featureSubsetStrategy, impurity * GBTRegressor: featureSubsetStrategy, impurity, {{lossType}} * {{}} * {\{KMeans: }}initMode * LDA: optimizer * PowerIterationClustering\{{: }}initMode * * ALS: coldStartStrategy, intermediateStorageLevel, finalStorageLevel * * Bucketizer: handleInvalid * ChiSqSelector: selectorType * Imputer: strategy * QuantileDiscretizer: handleInvalid * RFormula: handleInvalid, stringIndexerOrderType * StringIndexer: handleInvalid, stringOrderType * VectorAssembler: handleInvalid * VectorIndexer: handleInvalid * VectorSizeHint: handleInvalid * OneHotEncoderEstimator: handleInvalid (*this will be let alone until the breaking change*) * * BinaryClassificationEvaluator: metricName * MulticlassClassificationEvaluator: metricName * RegressionEvaluator: metricName * ClusteringEvaluator: metricName, distanceMeasure To to this: * methods {{lowerCaseInArray}} and {{upperCaseInArray}} are created in {{ParamValidators}} to check case-insensitivity; * methods {{{{$$(param: Param[String])}}}} and {{%%(param: Param[String])}} are created in trait {{Params}} to lower/upper the param value conveniently, and this can minimize the modifications in existing codes, since in many cases we only need to change {{$(param)}} to {{$$\{param}}}; * {{in \{{}}SharedParamsCodeGen}}, {{{{handleInvalid}}}} and {{distanceMeasure}} are updated to use \{{lowerCaseInArray} } > Unify String Params' case-insensitivity in ML > --------------------------------------------- > > Key: SPARK-26172 > URL: https://issues.apache.org/jira/browse/SPARK-26172 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 3.0.0 > Reporter: zhengruifeng > Priority: Major > > For now, there are three ways to deal with case-insensitivity in ML: > 1, support case-insensitivity, e.g. {{LogisticRegression}}; > 2, support case-insensitivity, but with getter returning the lower case value > (not the value passed to setter), e.g. {{ALS}},{{DecisionTreeClassifier}}; > 3, do not support case-insensitivity, e.g. {{NaiveBayes}} > > This situation result in confusion in usage. > I think we should choose the *first* way to support case-insensitivity of all > non-columnName string params, including: > * LogisticRegression: family > * MultilayerPerceptronClassifier: {{solver}} > * NaiveBayes: modelType > * DecisionTreeClassifier: impurity > * RandomForestClassifier: featureSubsetStrategy, impurity > * GBTClassifier: featureSubsetStrategy, impurity, {{lossType}} > * {{}} > * LinearRegression: solver, loss > * GeneralizedLinearRegression: family, link, solver > * DecisionTreeRegressor: impurity > * RandomForestRegressor: featureSubsetStrategy, impurity > * GBTRegressor: featureSubsetStrategy, impurity, {{lossType}} > * {{}} > * {\{KMeans: }}initMode > * LDA: optimizer > * PowerIterationClustering\{{: }}initMode > * > * ALS: coldStartStrategy, intermediateStorageLevel, finalStorageLevel > * > * Bucketizer: handleInvalid > * ChiSqSelector: selectorType > * Imputer: strategy > * QuantileDiscretizer: handleInvalid > * RFormula: handleInvalid, stringIndexerOrderType > * StringIndexer: handleInvalid, stringOrderType > * VectorAssembler: handleInvalid > * VectorIndexer: handleInvalid > * VectorSizeHint: handleInvalid > * OneHotEncoderEstimator: handleInvalid (*this will be let alone until the > breaking change*) > * > * BinaryClassificationEvaluator: metricName > * MulticlassClassificationEvaluator: metricName > * RegressionEvaluator: metricName > * ClusteringEvaluator: metricName, distanceMeasure > > > > To to this: > * methods {{lowerCaseInArray}} and {{upperCaseInArray}} are created in > {{ParamValidators}} to check case-insensitivity; > * methods {{{{$$(param: Param[String])}}}} and {{%%(param: Param[String])}} > are created in trait {{Params}} to lower/upper the param value conveniently, > and this can minimize the modifications in existing codes, since in many > cases we only need to change {{$(param)}} to {{$$\{param}}}; > * {{in \{{}}SharedParamsCodeGen}}, {{{{handleInvalid}}}} and > {{distanceMeasure}} are updated to use \{ {lowerCaseInArray}} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org