Praveen created SPARK-26738: ------------------------------- Summary: Pyspark random forest classifier feature importance with column names Key: SPARK-26738 URL: https://issues.apache.org/jira/browse/SPARK-26738 Project: Spark Issue Type: Question Components: ML Affects Versions: 2.3.2 Environment: {code:java} {code} Reporter: Praveen
I am trying to plot the feature importances of random forest classifier with with column names. I am using Spark 2.3.2 and Pyspark. The input X is sentences and i am using tfidf (HashingTF + IDF) + StringIndexer to generate the feature vectors. I have included all the stages in a Pipeline {{regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, minTokenLength=minimum_token_size) hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=number_of_feature) idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col) indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name) converter = IndexToString(inputCol='prediction', outputCol="original_label", labels=indexer.fit(df).labels) feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer]) estimator = RandomForestClassifier(labelCol=label_col, featuresCol=features_col, numTrees=100) pipeline = Pipeline(stages=[feature_pipeline, estimator, converter]) model = pipeline.fit(df)}}{{}} Generating the feature importances as {code:java} rdc = model.stages[-2] print (rdc.featureImportances) {code} So far so good, but when i try to map the feature importances to the feature columns as below {code:java} attrs = sorted((attr["idx"], attr["name"]) for attr in (chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values()))) [(name, rdc.featureImportances[idx]) for idx, name in attrs if dtModel_1.featureImportances[idx]]{code} I get the key error on ml_attr {code:java} KeyError: 'ml_attr'{code} The printed the dictionary, {code:java} print (df_pred.schema["featurescol"].metadata){code} and it's empty {} Any thoughts on what I am doing wrong ? How can I getting feature importances to the columns names. Thanks -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org