Thanks Yanbo, will try that! On Sun, Jul 17, 2016 at 10:26 PM, Yanbo Liang <yblia...@gmail.com> wrote:
> Hi Tobi, > > Thanks for clarifying the question. It's very straight forward to convert > the filtered RDD to DataFrame, you can refer the following code snippets: > > from pyspark.sql import Row > > rdd2 = filteredRDD.map(lambda v: Row(features=v)) > > df = rdd2.toDF() > > > Thanks > Yanbo > > 2016-07-16 14:51 GMT-07:00 Tobi Bosede <ani.to...@gmail.com>: > >> Hi Yanbo, >> >> Appreciate the response. I might not have phrased this correctly, but I >> really wanted to know how to convert the pipeline rdd into a data frame. I >> have seen the example you posted. However I need to transform all my data, >> just not 1 line. So I did sucessfully use map to use the chisq selector to >> filter the chosen features of my data. I just want to convert it to a df so >> I can apply a logistic regression model from spark.ml. >> >> Trust me I would use the dataframes api if I could, but the chisq >> functionality is not available to me in the python spark 1.4 api. >> >> Regards, >> Tobi >> >> On Jul 16, 2016 4:53 AM, "Yanbo Liang" <yblia...@gmail.com> wrote: >> >>> Hi Tobi, >>> >>> The MLlib RDD-based API does support to apply transformation on both >>> Vector and RDD, but you did not use the appropriate way to do. >>> Suppose you have a RDD with LabeledPoint in each line, you can refer the >>> following code snippets to train a ChiSqSelectorModel model and do >>> transformation: >>> >>> from pyspark.mllib.regression import LabeledPoint >>> >>> from pyspark.mllib.feature import ChiSqSelector >>> >>> data = [LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), >>> LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})), LabeledPoint(1.0, >>> [0.0, 9.0, 8.0]), LabeledPoint(2.0, [8.0, 9.0, 5.0])] >>> >>> rdd = sc.parallelize(data) >>> >>> model = ChiSqSelector(1).fit(rdd) >>> >>> filteredRDD = model.transform(rdd.map(lambda lp: lp.features)) >>> >>> filteredRDD.collect() >>> >>> However, we strongly recommend you to migrate to DataFrame-based API >>> since the RDD-based API is switched to maintain mode. >>> >>> Thanks >>> Yanbo >>> >>> 2016-07-14 13:23 GMT-07:00 Tobi Bosede <ani.to...@gmail.com>: >>> >>>> Hi everyone, >>>> >>>> I am trying to filter my features based on the spark.mllib >>>> ChiSqSelector. >>>> >>>> filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label, >>>> model.transform(lp.features))) >>>> >>>> However when I do the following I get the error below. Is there any >>>> other way to filter my data to avoid this error? >>>> >>>> filteredDataDF=filteredData.toDF() >>>> >>>> Exception: It appears that you are attempting to reference SparkContext >>>> from a broadcast variable, action, or transforamtion. SparkContext can >>>> only be used on the driver, not in code that it run on workers. For more >>>> information, see SPARK-5063. >>>> >>>> >>>> I would directly use the spark.ml ChiSqSelector and work with dataframes, >>>> but I am on spark 1.4 and using pyspark. So spark.ml's ChiSqSelector is >>>> not available to me. filteredData is of type piplelineRDD, if that helps. >>>> It is not a regular RDD. I think that may part of why calling toDF() is >>>> not working. >>>> >>>> >>>> Thanks, >>>> >>>> Tobi >>>> >>>> >>> >