[ https://issues.apache.org/jira/browse/SPARK-24712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marco Gaido resolved SPARK-24712. --------------------------------- Resolution: Not A Problem > TrainValidationSplit ignores label column name and forces to be "label" > ----------------------------------------------------------------------- > > Key: SPARK-24712 > URL: https://issues.apache.org/jira/browse/SPARK-24712 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.2.0 > Reporter: Pablo J. Villacorta > Priority: Major > > When a TrainValidationSplit is fit on a Pipeline containing a ML model, the > labelCol property of the model is ignored, and the call to fit() will fail > unless the labelCol equals "label". As an example, the following pyspark code > only works when the variable labelColumnĀ is set to "label" > {code:java} > from pyspark.sql.functions import rand, randn > from pyspark.ml.regression import LinearRegression > labelColumn = "target" # CHANGE THIS TO "label" AND THE CODE WORKS > df = spark.range(0, 10).select(rand(seed=10).alias("uniform"), > randn(seed=27).alias(labelColumn)) > vectorAssembler = > VectorAssembler().setInputCols(["uniform"]).setOutputCol("features") > lr = LinearRegression().setFeaturesCol("features").setLabelCol(labelColumn) > mypipeline = Pipeline(stages = [vectorAssembler, lr]) > paramGrid = ParamGridBuilder()\ > .addGrid(lr.regParam, [0.01, 0.1])\ > .build() > trainValidationSplit = TrainValidationSplit()\ > .setEstimator(mypipeline)\ > .setEvaluator(RegressionEvaluator())\ > .setEstimatorParamMaps(paramGrid)\ > .setTrainRatio(0.8) > trainValidationSplit.fit(df) # FAIL UNLESS labelColumn IS SET TO "label" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org