[ https://issues.apache.org/jira/browse/SPARK-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Ma closed SPARK-16247. ----------------------------- Misusage. Resolved. > Using pyspark dataframe with pipeline and cross validator > --------------------------------------------------------- > > Key: SPARK-16247 > URL: https://issues.apache.org/jira/browse/SPARK-16247 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 1.6.1 > Reporter: Edward Ma > > I am using pyspark with dataframe. Using pipeline operation to train and > predict the result. It is alright for single testing. > However, I got issue when using pipeline and CrossValidator. The issue is > that I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and > feature. Those fields are built by StringIndexer and VectorIndex. It suppose > to be existed after executing pipeline. > Then I dig into pyspark library [python/pyspark/ml/tuning.py] (line 222, _fit > function and line 239, est.fit), I found that it does not execute pipeline > stage. Therefore, I cannot get "indexedLabel" and "indexedMsg". > Would you mind advising whether my usage is correct or not. > Thanks. > Here is code snippet > {noformat} > // # Indexing > labelIndexer = StringIndexer(inputCol="label", > outputCol="indexedLabel").fit(extracted_data) > featureIndexer = VectorIndexer(inputCol="extracted_msg", > outputCol="indexedMsg", maxCategories=3000).fit(extracted_data) > // # Training > classification_model = RandomForestClassifier(labelCol="indexedLabel", > featuresCol="indexedMsg", numTrees=50, maxDepth=20) > pipeline = Pipeline(stages=[labelIndexer, featureIndexer, > classification_model]) > // # Cross Validation > paramGrid = ParamGridBuilder().addGrid(classification_model.maxDepth, (10, > 20, 30)).build() > cvEvaluator = MulticlassClassificationEvaluator(metricName="precision") > cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, > evaluator=cvEvaluator, numFolds=10) > cvModel = cv.fit(trainingData) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org