[ https://issues.apache.org/jira/browse/SPARK-26738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753867#comment-16753867 ]
Hyukjin Kwon commented on SPARK-26738: -------------------------------------- Questions should go to mailing list. Let's ask a question there before filing an issue here. You could have a better answer there. > Pyspark random forest classifier feature importance with column names > --------------------------------------------------------------------- > > Key: SPARK-26738 > URL: https://issues.apache.org/jira/browse/SPARK-26738 > Project: Spark > Issue Type: Question > Components: ML > Affects Versions: 2.3.2 > Reporter: Praveen > Priority: Major > Labels: RandomForest, pyspark > > I am trying to plot the feature importances of random forest classifier with > with column names. I am using Spark 2.3.2 and Pyspark. > The input X is sentences and i am using tfidf (HashingTF + IDF) + > StringIndexer to generate the feature vectors. > I have included all the stages in a Pipeline > > {code:java} > regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, > outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, > minTokenLength=minimum_token_size) > hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", > numFeatures=number_of_feature) > idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col) > indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name) > converter = IndexToString(inputCol='prediction', outputCol="original_label", > labels=indexer.fit(df).labels) > feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer]) > estimator = RandomForestClassifier(labelCol=label_col, > featuresCol=features_col, numTrees=100) > pipeline = Pipeline(stages=[feature_pipeline, estimator, converter]) > model = pipeline.fit(df) > {code} > Generating the feature importances as > {code:java} > rdc = model.stages[-2] > print (rdc.featureImportances) > {code} > So far so good, but when i try to map the feature importances to the feature > columns as below > {code:java} > attrs = sorted((attr["idx"], attr["name"]) for attr in > (chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values()))) > [(name, rdc.featureImportances[idx]) > for idx, name in attrs > if dtModel_1.featureImportances[idx]]{code} > > I get the key error on ml_attr > {code:java} > KeyError: 'ml_attr'{code} > The printed the dictionary, > {code:java} > print (df_pred.schema["featurescol"].metadata){code} > and it's empty {} > Any thoughts on what I am doing wrong ? How can I getting feature importances > to the columns names. > Thanks -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org