Hi Yan Yes sql is good option , but if we have to create ML Pipeline , then having transformers and set it into pipeline stages ,would be better option .
Regards Pralabh Kumar On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai) <facai....@gmail.com> wrote: > To filter data, how about using sql? > > df.createOrReplaceTempView("df") > val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN > (HAPPY,SAD,ANGRY,NEUTRAL,NA)") > > https://spark.apache.org/docs/latest/sql-programming-guide.html#sql > > > > On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar <pralabhku...@gmail.com> > wrote: > >> Hi Saatvik >> >> You can write your own transformer to make sure that column contains >> ,value which u provided , and filter out rows which doesn't follow the >> same. >> >> Something like this >> >> >> case class CategoryTransformer(override val uid : String) extends >> Transformer{ >> override def transform(inputData: DataFrame): DataFrame = { >> inputData.select("col1").filter("col1 in ('happy')") >> } >> override def copy(extra: ParamMap): Transformer = ??? >> @DeveloperApi >> override def transformSchema(schema: StructType): StructType ={ >> schema >> } >> } >> >> >> Usage >> >> val data = sc.parallelize(List("abce","happy")).toDF("col1") >> val trans = new CategoryTransformer("1") >> data.show() >> trans.transform(data).show() >> >> >> This transformer will make sure , you always have values in col1 as >> provided by you. >> >> >> Regards >> Pralabh Kumar >> >> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah <saatvikshah1...@gmail.com> >> wrote: >> >>> Hi Pralabh, >>> >>> I want the ability to create a column such that its values be restricted >>> to a specific set of predefined values. >>> For example, suppose I have a column called EMOTION: I want to ensure >>> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA. >>> >>> Thanks and Regards, >>> Saatvik Shah >>> >>> >>> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <pralabhku...@gmail.com> >>> wrote: >>> >>>> Hi satvik >>>> >>>> Can u please provide an example of what exactly you want. >>>> >>>> >>>> >>>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" <saatvikshah1...@gmail.com> >>>> wrote: >>>> >>>>> Hi Yan, >>>>> >>>>> Basically the reason I was looking for the categorical datatype is as >>>>> given here >>>>> <https://pandas.pydata.org/pandas-docs/stable/categorical.html>: >>>>> ability to fix column values to specific categories. Is it possible to >>>>> create a user defined data type which could do so? >>>>> >>>>> Thanks and Regards, >>>>> Saatvik Shah >>>>> >>>>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <facai....@gmail.com> >>>>> wrote: >>>>> >>>>>> You can use some Transformers to handle categorical data, >>>>>> For example, >>>>>> StringIndexer encodes a string column of labels to a column of label >>>>>> indices: >>>>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer >>>>>> >>>>>> >>>>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 < >>>>>> saatvikshah1...@gmail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> I'm trying to convert a Pandas -> Spark dataframe. One of the >>>>>>> columns I have >>>>>>> is of the Category type in Pandas. But there does not seem to be >>>>>>> support for >>>>>>> this same type in Spark. What is the best alternative? >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> View this message in context: http://apache-spark-user-list. >>>>>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in- >>>>>>> Spark-Dataframe-tp28764.html >>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>> Nabble.com. >>>>>>> >>>>>>> ------------------------------------------------------------ >>>>>>> --------- >>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> *Saatvik Shah,* >>>>> *1st Year,* >>>>> *Masters in the School of Computer Science,* >>>>> *Carnegie Mellon University* >>>>> >>>>> *https://saatvikshah1994.github.io/ >>>>> <https://saatvikshah1994.github.io/>* >>>>> >>>> >>> >>> >>> -- >>> *Saatvik Shah,* >>> *1st Year,* >>> *Masters in the School of Computer Science,* >>> *Carnegie Mellon University* >>> >>> *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>* >>> >> >> >