Hi Saatvik You can write your own transformer to make sure that column contains ,value which u provided , and filter out rows which doesn't follow the same.
Something like this case class CategoryTransformer(override val uid : String) extends Transformer{ override def transform(inputData: DataFrame): DataFrame = { inputData.select("col1").filter("col1 in ('happy')") } override def copy(extra: ParamMap): Transformer = ??? @DeveloperApi override def transformSchema(schema: StructType): StructType ={ schema } } Usage val data = sc.parallelize(List("abce","happy")).toDF("col1") val trans = new CategoryTransformer("1") data.show() trans.transform(data).show() This transformer will make sure , you always have values in col1 as provided by you. Regards Pralabh Kumar On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah <saatvikshah1...@gmail.com> wrote: > Hi Pralabh, > > I want the ability to create a column such that its values be restricted > to a specific set of predefined values. > For example, suppose I have a column called EMOTION: I want to ensure each > row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA. > > Thanks and Regards, > Saatvik Shah > > > On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <pralabhku...@gmail.com> > wrote: > >> Hi satvik >> >> Can u please provide an example of what exactly you want. >> >> >> >> On 16-Jun-2017 7:40 PM, "Saatvik Shah" <saatvikshah1...@gmail.com> wrote: >> >>> Hi Yan, >>> >>> Basically the reason I was looking for the categorical datatype is as >>> given here >>> <https://pandas.pydata.org/pandas-docs/stable/categorical.html>: >>> ability to fix column values to specific categories. Is it possible to >>> create a user defined data type which could do so? >>> >>> Thanks and Regards, >>> Saatvik Shah >>> >>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <facai....@gmail.com> >>> wrote: >>> >>>> You can use some Transformers to handle categorical data, >>>> For example, >>>> StringIndexer encodes a string column of labels to a column of label >>>> indices: >>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer >>>> >>>> >>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 < >>>> saatvikshah1...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> I'm trying to convert a Pandas -> Spark dataframe. One of the columns >>>>> I have >>>>> is of the Category type in Pandas. But there does not seem to be >>>>> support for >>>>> this same type in Spark. What is the best alternative? >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: http://apache-spark-user-list. >>>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in- >>>>> Spark-Dataframe-tp28764.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>> >>>>> >>>> >>> >>> >>> -- >>> *Saatvik Shah,* >>> *1st Year,* >>> *Masters in the School of Computer Science,* >>> *Carnegie Mellon University* >>> >>> *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>* >>> >> > > > -- > *Saatvik Shah,* > *1st Year,* > *Masters in the School of Computer Science,* > *Carnegie Mellon University* > > *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>* >