Pablo J. Villacorta created SPARK-24712: -------------------------------------------
Summary: TrainValidationSplit ignores label column name and forces to be "label" Key: SPARK-24712 URL: https://issues.apache.org/jira/browse/SPARK-24712 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.2.0 Reporter: Pablo J. Villacorta When a TrainValidationSplit is fit on a Pipeline containing a ML model, the labelCol property of the model is ignored, and the call to fit() will fail unless the labelCol equals "label". As an example, the following pyspark code only wors when the variable labelColumnĀ is set to "label" {code:java} from pyspark.sql.functions import rand, randn from pyspark.ml.regression import LinearRegression labelColumn = "target" # CHANGE THIS TO "label" AND THE CODE WORKS df = spark.range(0, 10).select(rand(seed=10).alias("uniform"), randn(seed=27).alias(labelColumn)) vectorAssembler = VectorAssembler().setInputCols(["uniform"]).setOutputCol("features") lr = LinearRegression().setFeaturesCol("features").setLabelCol(labelColumn) mypipeline = Pipeline(stages = [vectorAssembler, lr]) paramGrid = ParamGridBuilder()\ .addGrid(lr.regParam, [0.01, 0.1])\ .build() trainValidationSplit = TrainValidationSplit()\ .setEstimator(mypipeline)\ .setEvaluator(RegressionEvaluator())\ .setEstimatorParamMaps(paramGrid)\ .setTrainRatio(0.8) trainValidationSplit.fit(df) # FAIL UNLESS labelColumn IS SET TO "label" {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org