subject:"\[jira\] \[Updated\] \(SPARK\-26738\) Pyspark random forest classifier feature importance with column names"

[jira] [Updated] (SPARK-26738) Pyspark random forest classifier feature importance with column names

2019-01-28 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26738:
-
Environment: (was: {code:java}
 {code})

> Pyspark random forest classifier feature importance with column names
> -
>
> Key: SPARK-26738
> URL: https://issues.apache.org/jira/browse/SPARK-26738
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.3.2
>Reporter: Praveen
>Priority: Major
>  Labels: RandomForest, pyspark
>
> I am trying to plot the feature importances of random forest classifier with 
> with column names. I am using Spark 2.3.2 and Pyspark.
> The input X is sentences and i am using tfidf (HashingTF + IDF) + 
> StringIndexer to generate the feature vectors.
> I have included all the stages in a Pipeline
>  
> {code:java}
> regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, 
> outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, 
> minTokenLength=minimum_token_size)
> hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", 
> numFeatures=number_of_feature)
> idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col)
> indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name)
> converter = IndexToString(inputCol='prediction', outputCol="original_label", 
> labels=indexer.fit(df).labels)
> feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer])
> estimator = RandomForestClassifier(labelCol=label_col, 
> featuresCol=features_col, numTrees=100)
> pipeline = Pipeline(stages=[feature_pipeline, estimator, converter])
> model = pipeline.fit(df)
> {code}
> Generating the feature importances as
> {code:java}
> rdc = model.stages[-2]
> print (rdc.featureImportances)
> {code}
> So far so good, but when i try to map the feature importances to the feature 
> columns as below
> {code:java}
> attrs = sorted((attr["idx"], attr["name"]) for attr in 
> (chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values(
> [(name, rdc.featureImportances[idx])
>for idx, name in attrs
>if dtModel_1.featureImportances[idx]]{code}
>  
> I get the key error on ml_attr
> {code:java}
> KeyError: 'ml_attr'{code}
> The printed the dictionary,
> {code:java}
> print (df_pred.schema["featurescol"].metadata){code}
> and it's empty {}
> Any thoughts on what I am doing wrong ? How can I getting feature importances 
> to the columns names.
> Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26738) Pyspark random forest classifier feature importance with column names

2019-01-26 Thread Praveen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen updated SPARK-26738:

Description: 
I am trying to plot the feature importances of random forest classifier with 
with column names. I am using Spark 2.3.2 and Pyspark.

The input X is sentences and i am using tfidf (HashingTF + IDF) + StringIndexer 
to generate the feature vectors.

I have included all the stages in a Pipeline

 
{code:java}
regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, outputCol= 
"words", pattern="[a-zA-Z_]+", toLowercase=True, 
minTokenLength=minimum_token_size)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", 
numFeatures=number_of_feature)
idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col)
indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name)
converter = IndexToString(inputCol='prediction', outputCol="original_label", 
labels=indexer.fit(df).labels)
feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer])
estimator = RandomForestClassifier(labelCol=label_col, 
featuresCol=features_col, numTrees=100)
pipeline = Pipeline(stages=[feature_pipeline, estimator, converter])
model = pipeline.fit(df)
{code}
Generating the feature importances as
{code:java}
rdc = model.stages[-2]
print (rdc.featureImportances)
{code}
So far so good, but when i try to map the feature importances to the feature 
columns as below
{code:java}
attrs = sorted((attr["idx"], attr["name"]) for attr in 
(chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values(

[(name, rdc.featureImportances[idx])
   for idx, name in attrs
   if dtModel_1.featureImportances[idx]]{code}
 

I get the key error on ml_attr
{code:java}
KeyError: 'ml_attr'{code}
The printed the dictionary,
{code:java}
print (df_pred.schema["featurescol"].metadata){code}
and it's empty {}

Any thoughts on what I am doing wrong ? How can I getting feature importances 
to the columns names.

Thanks

  was:
I am trying to plot the feature importances of random forest classifier with 
with column names. I am using Spark 2.3.2 and Pyspark.

The input X is sentences and i am using tfidf (HashingTF + IDF) + StringIndexer 
to generate the feature vectors.

I have included all the stages in a Pipeline

 

 

{{regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, 
outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, 
minTokenLength=minimum_token_size) 

hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", 
numFeatures=number_of_feature) 

idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col) 

indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name) 

converter = IndexToString(inputCol='prediction', outputCol="original_label", 
labels=indexer.fit(df).labels) 

feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer]) 

estimator = RandomForestClassifier(labelCol=label_col, 
featuresCol=features_col, numTrees=100) 

pipeline = Pipeline(stages=[feature_pipeline, estimator, converter])

model = pipeline.fit(df)}}{{}}

 

 

Generating the feature importances as

 
{code:java}
rdc = model.stages[-2]
print (rdc.featureImportances)
{code}
So far so good, but when i try to map the feature importances to the feature 
columns as below
{code:java}
attrs = sorted((attr["idx"], attr["name"]) for attr in 
(chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values(

[(name, rdc.featureImportances[idx])
   for idx, name in attrs
   if dtModel_1.featureImportances[idx]]{code}
 

I get the key error on ml_attr
{code:java}
KeyError: 'ml_attr'{code}
The printed the dictionary,
{code:java}
print (df_pred.schema["featurescol"].metadata){code}
and it's empty {}

Any thoughts on what I am doing wrong ? How can I getting feature importances 
to the columns names.

Thanks


> Pyspark random forest classifier feature importance with column names
> -
>
> Key: SPARK-26738
> URL: https://issues.apache.org/jira/browse/SPARK-26738
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.3.2
> Environment: {code:java}
>  {code}
>Reporter: Praveen
>Priority: Major
>  Labels: RandomForest, pyspark
>
> I am trying to plot the feature importances of random forest classifier with 
> with column names. I am using Spark 2.3.2 and Pyspark.
> The input X is sentences and i am using tfidf (HashingTF + IDF) + 
> StringIndexer to generate the feature vectors.
> I have included all the stages in a Pipeline
>  
> {code:java}
> regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, 
> outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, 
> minTokenLength=minimum_token_size)
> hashingTF = HashingTF(inputCol="words",

[jira] [Updated] (SPARK-26738) Pyspark random forest classifier feature importance with column names

[jira] [Updated] (SPARK-26738) Pyspark random forest classifier feature importance with column names

2 matches

Site Navigation

Mail list logo

Footer information