[ https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218279#comment-16218279 ]
Weichen Xu commented on SPARK-22277: ------------------------------------ [~cheburakshu] Q1. If I use numFeatures=5 and numFeatures=3 and examine the selectedFeatures indices, the three features are not a subset of 5 features. Can you post the code reproduce it so I can help check where is wrong ? Q2: If I use VectorIndexer/StringIndexer+OneHotEncoder before using in ChiSqSelector, the selectedFeatures indices of the model go out of bounds. Why use VectorIndexer +OneHotEncoder ? I think here you can use "VectorIndexer + ChiSqSelector + DecisionTreeClassifier" Or "VectorIndexer + DecisionTreeClassifier" I think they will both run successfully. (But if you still cannot run successfully, maybe there's really some bug) > Chi Square selector garbling Vector content. > -------------------------------------------- > > Key: SPARK-22277 > URL: https://issues.apache.org/jira/browse/SPARK-22277 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 2.1.1 > Reporter: Cheburakshu > > There is a difference in behavior when Chisquare selector is used v direct > feature use in decision tree classifier. > In the below code, I have used chisquare selector as a thru' pass but the > decision tree classifier is unable to process it. But, it is able to process > when the features are used directly. > The example is pulled out directly from Apache spark python documentation. > Kindly help. > {code:python} > from pyspark.ml.feature import ChiSqSelector > from pyspark.ml.linalg import Vectors > import sys > df = spark.createDataFrame([ > (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), > (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), > (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", > "clicked"]) > # ChiSq selector will just be a pass-through. All four featuresin the i/p > will be in output also. > selector = ChiSqSelector(numTopFeatures=4, featuresCol="features", > outputCol="selectedFeatures", labelCol="clicked") > result = selector.fit(df).transform(df) > print("ChiSqSelector output with top %d features selected" % > selector.getNumTopFeatures()) > from pyspark.ml.classification import DecisionTreeClassifier > try: > # Fails > dt = > DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures") > model = dt.fit(result) > except: > print(sys.exc_info()) > #Works > dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features") > model = dt.fit(df) > > # Make predictions. Using same dataset, not splitting!! > predictions = model.transform(result) > # Select example rows to display. > predictions.select("prediction", "clicked", "features").show(5) > # Select (prediction, true label) and compute test error > evaluator = MulticlassClassificationEvaluator( > labelCol="clicked", predictionCol="prediction", metricName="accuracy") > accuracy = evaluator.evaluate(predictions) > print("Test Error = %g " % (1.0 - accuracy)) > {code} > Output: > ChiSqSelector output with top 4 features selected > (<class 'pyspark.sql.utils.IllegalArgumentException'>, > IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but > it does not have the number of values specified.', > 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t > at > org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t > at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at > org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t > at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at > org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at > sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t > at java.lang.reflect.Method.invoke(Method.java:498)\n\t at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at > py4j.Gateway.invoke(Gateway.java:280)\n\t at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at > py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at > py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at > java.lang.Thread.run(Thread.java:745)'), <traceback object at 0x0A87D878>) > +----------+-------+------------------+ > |prediction|clicked| features| > +----------+-------+------------------+ > | 1.0| 1.0|[0.0,0.0,18.0,1.0]| > | 0.0| 0.0|[0.0,1.0,12.0,0.0]| > | 0.0| 0.0|[1.0,0.0,15.0,0.1]| > +----------+-------+------------------+ > Test Error = 0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org