Edward Ma created SPARK-16247: --------------------------------- Summary: Using pyspark dataframe with pipeline and cross validator Key: SPARK-16247 URL: https://issues.apache.org/jira/browse/SPARK-16247 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.6.1 Reporter: Edward Ma
I am using pyspark with dataframe. Using pipeline operation to train and predict the result. It is alright for single testing. However, I got issue when using pipeline and CrossValidator. The issue is that I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and feature. Those fields are built by StringIndexer and VectorIndex. It suppose to be existed after executing pipeline. Then I dig into pyspark library (line 222, _fit function and line 239, est.fit), I found that it does not execute pipeline stage. Therefore, I cannot get "indexedLabel" and "indexedMsg". Would you mind advising whether my usage is correct or not. Thanks. Here is code snippet # Indexing labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(extracted_data) featureIndexer = VectorIndexer(inputCol="extracted_msg", outputCol="indexedMsg", maxCategories=3000).fit(extracted_data) # Training classification_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedMsg", numTrees=50, maxDepth=20) pipeline = Pipeline(stages=[labelIndexer, featureIndexer, classification_model]) # Cross Validation paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build() cvEvaluator = MulticlassClassificationEvaluator(metricName="precision") cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=cvEvaluator, numFolds=10) cvModel = cv.fit(trainingData) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org