Hi Tobi, The MLlib RDD-based API does support to apply transformation on both Vector and RDD, but you did not use the appropriate way to do. Suppose you have a RDD with LabeledPoint in each line, you can refer the following code snippets to train a ChiSqSelectorModel model and do transformation:
from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.feature import ChiSqSelector data = [LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})), LabeledPoint(1.0, [0.0, 9.0, 8.0]), LabeledPoint(2.0, [8.0, 9.0, 5.0])] rdd = sc.parallelize(data) model = ChiSqSelector(1).fit(rdd) filteredRDD = model.transform(rdd.map(lambda lp: lp.features)) filteredRDD.collect() However, we strongly recommend you to migrate to DataFrame-based API since the RDD-based API is switched to maintain mode. Thanks Yanbo 2016-07-14 13:23 GMT-07:00 Tobi Bosede <ani.to...@gmail.com>: > Hi everyone, > > I am trying to filter my features based on the spark.mllib ChiSqSelector. > > filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label, > model.transform(lp.features))) > > However when I do the following I get the error below. Is there any other > way to filter my data to avoid this error? > > filteredDataDF=filteredData.toDF() > > Exception: It appears that you are attempting to reference SparkContext from > a broadcast variable, action, or transforamtion. SparkContext can only be > used on the driver, not in code that it run on workers. For more information, > see SPARK-5063. > > > I would directly use the spark.ml ChiSqSelector and work with dataframes, but > I am on spark 1.4 and using pyspark. So spark.ml's ChiSqSelector is not > available to me. filteredData is of type piplelineRDD, if that helps. It is > not a regular RDD. I think that may part of why calling toDF() is not working. > > > Thanks, > > Tobi > >