Hi Tobi, Thanks for clarifying the question. It's very straight forward to convert the filtered RDD to DataFrame, you can refer the following code snippets:
from pyspark.sql import Row rdd2 = filteredRDD.map(lambda v: Row(features=v)) df = rdd2.toDF() Thanks Yanbo 2016-07-16 14:51 GMT-07:00 Tobi Bosede <ani.to...@gmail.com>: > Hi Yanbo, > > Appreciate the response. I might not have phrased this correctly, but I > really wanted to know how to convert the pipeline rdd into a data frame. I > have seen the example you posted. However I need to transform all my data, > just not 1 line. So I did sucessfully use map to use the chisq selector to > filter the chosen features of my data. I just want to convert it to a df so > I can apply a logistic regression model from spark.ml. > > Trust me I would use the dataframes api if I could, but the chisq > functionality is not available to me in the python spark 1.4 api. > > Regards, > Tobi > > On Jul 16, 2016 4:53 AM, "Yanbo Liang" <yblia...@gmail.com> wrote: > >> Hi Tobi, >> >> The MLlib RDD-based API does support to apply transformation on both >> Vector and RDD, but you did not use the appropriate way to do. >> Suppose you have a RDD with LabeledPoint in each line, you can refer the >> following code snippets to train a ChiSqSelectorModel model and do >> transformation: >> >> from pyspark.mllib.regression import LabeledPoint >> >> from pyspark.mllib.feature import ChiSqSelector >> >> data = [LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), >> LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})), LabeledPoint(1.0, >> [0.0, 9.0, 8.0]), LabeledPoint(2.0, [8.0, 9.0, 5.0])] >> >> rdd = sc.parallelize(data) >> >> model = ChiSqSelector(1).fit(rdd) >> >> filteredRDD = model.transform(rdd.map(lambda lp: lp.features)) >> >> filteredRDD.collect() >> >> However, we strongly recommend you to migrate to DataFrame-based API >> since the RDD-based API is switched to maintain mode. >> >> Thanks >> Yanbo >> >> 2016-07-14 13:23 GMT-07:00 Tobi Bosede <ani.to...@gmail.com>: >> >>> Hi everyone, >>> >>> I am trying to filter my features based on the spark.mllib >>> ChiSqSelector. >>> >>> filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label, >>> model.transform(lp.features))) >>> >>> However when I do the following I get the error below. Is there any >>> other way to filter my data to avoid this error? >>> >>> filteredDataDF=filteredData.toDF() >>> >>> Exception: It appears that you are attempting to reference SparkContext >>> from a broadcast variable, action, or transforamtion. SparkContext can only >>> be used on the driver, not in code that it run on workers. For more >>> information, see SPARK-5063. >>> >>> >>> I would directly use the spark.ml ChiSqSelector and work with dataframes, >>> but I am on spark 1.4 and using pyspark. So spark.ml's ChiSqSelector is not >>> available to me. filteredData is of type piplelineRDD, if that helps. It is >>> not a regular RDD. I think that may part of why calling toDF() is not >>> working. >>> >>> >>> Thanks, >>> >>> Tobi >>> >>> >>