Hi Yan

Yes sql is good option , but if we have to create ML Pipeline , then having
transformers and set it into pipeline stages ,would be better option .

Regards
Pralabh Kumar

On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai) <facai....@gmail.com> wrote:

> To filter data, how about using sql?
>
> df.createOrReplaceTempView("df")
> val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN 
> (HAPPY,SAD,ANGRY,NEUTRAL,NA)")
>
> https://spark.apache.org/docs/latest/sql-programming-guide.html#sql
>
>
>
> On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar <pralabhku...@gmail.com>
> wrote:
>
>> Hi Saatvik
>>
>> You can write your own transformer to make sure that column contains
>> ,value which u provided , and filter out rows which doesn't follow the
>> same.
>>
>> Something like this
>>
>>
>> case class CategoryTransformer(override val uid : String) extends
>> Transformer{
>>   override def transform(inputData: DataFrame): DataFrame = {
>>     inputData.select("col1").filter("col1 in ('happy')")
>>   }
>>   override def copy(extra: ParamMap): Transformer = ???
>>   @DeveloperApi
>>   override def transformSchema(schema: StructType): StructType ={
>>    schema
>>   }
>> }
>>
>>
>> Usage
>>
>> val data = sc.parallelize(List("abce","happy")).toDF("col1")
>> val trans = new CategoryTransformer("1")
>> data.show()
>> trans.transform(data).show()
>>
>>
>> This transformer will make sure , you always have values in col1 as
>> provided by you.
>>
>>
>> Regards
>> Pralabh Kumar
>>
>> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah <saatvikshah1...@gmail.com>
>> wrote:
>>
>>> Hi Pralabh,
>>>
>>> I want the ability to create a column such that its values be restricted
>>> to a specific set of predefined values.
>>> For example, suppose I have a column called EMOTION: I want to ensure
>>> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>>>
>>> Thanks and Regards,
>>> Saatvik Shah
>>>
>>>
>>> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <pralabhku...@gmail.com>
>>> wrote:
>>>
>>>> Hi satvik
>>>>
>>>> Can u please provide an example of what exactly you want.
>>>>
>>>>
>>>>
>>>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" <saatvikshah1...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Yan,
>>>>>
>>>>> Basically the reason I was looking for the categorical datatype is as
>>>>> given here
>>>>> <https://pandas.pydata.org/pandas-docs/stable/categorical.html>:
>>>>> ability to fix column values to specific categories. Is it possible to
>>>>> create a user defined data type which could do so?
>>>>>
>>>>> Thanks and Regards,
>>>>> Saatvik Shah
>>>>>
>>>>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <facai....@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> You can use some Transformers to handle categorical data,
>>>>>> For example,
>>>>>> StringIndexer encodes a string column of labels to a column of label
>>>>>> indices:
>>>>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>>>>>> saatvikshah1...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> I'm trying to convert a Pandas -> Spark dataframe. One of the
>>>>>>> columns I have
>>>>>>> is of the Category type in Pandas. But there does not seem to be
>>>>>>> support for
>>>>>>> this same type in Spark. What is the best alternative?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context: http://apache-spark-user-list.
>>>>>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>>>>>>> Spark-Dataframe-tp28764.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> ------------------------------------------------------------
>>>>>>> ---------
>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Saatvik Shah,*
>>>>> *1st  Year,*
>>>>> *Masters in the School of Computer Science,*
>>>>> *Carnegie Mellon University*
>>>>>
>>>>> *https://saatvikshah1994.github.io/
>>>>> <https://saatvikshah1994.github.io/>*
>>>>>
>>>>
>>>
>>>
>>> --
>>> *Saatvik Shah,*
>>> *1st  Year,*
>>> *Masters in the School of Computer Science,*
>>> *Carnegie Mellon University*
>>>
>>> *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*
>>>
>>
>>
>

Reply via email to