Re: Best alternative for Category Type in Spark Dataframe

Saatvik Shah Sat, 17 Jun 2017 21:36:07 -0700

Thanks guys,

You'll have given a number of options to work with.


The thing is that Im working in a production environment where it might be
necessary to ensure that no one erroneously inserts new records in those
specific columns which should be the Category data type. The best
alternative there would be to have a Category-like dataframe column
datatype, without the additional overhead of running a transformer. Is that
possible?

Thanks and Regards,
Saatvik

On Sat, Jun 17, 2017 at 11:15 PM, Pralabh Kumar <pralabhku...@gmail.com>
wrote:

> make sense :)
>
> On Sun, Jun 18, 2017 at 8:38 AM, 颜发才(Yan Facai) <facai....@gmail.com>
> wrote:
>
>> Yes, perhaps we could use SQLTransformer as well.
>>
>> http://spark.apache.org/docs/latest/ml-features.html#sqltransformer
>>
>> On Sun, Jun 18, 2017 at 10:47 AM, Pralabh Kumar <pralabhku...@gmail.com>
>> wrote:
>>
>>> Hi Yan
>>>
>>> Yes sql is good option , but if we have to create ML Pipeline , then
>>> having transformers and set it into pipeline stages ,would be better option
>>> .
>>>
>>> Regards
>>> Pralabh Kumar
>>>
>>> On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai) <facai....@gmail.com>
>>> wrote:
>>>
>>>> To filter data, how about using sql?
>>>>
>>>> df.createOrReplaceTempView("df")
>>>> val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN 
>>>> (HAPPY,SAD,ANGRY,NEUTRAL,NA)")
>>>>
>>>> https://spark.apache.org/docs/latest/sql-programming-guide.html#sql
>>>>
>>>>
>>>>
>>>> On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar <pralabhku...@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Saatvik
>>>>>
>>>>> You can write your own transformer to make sure that column contains
>>>>> ,value which u provided , and filter out rows which doesn't follow the
>>>>> same.
>>>>>
>>>>> Something like this
>>>>>
>>>>>
>>>>> case class CategoryTransformer(override val uid : String) extends
>>>>> Transformer{
>>>>>   override def transform(inputData: DataFrame): DataFrame = {
>>>>>     inputData.select("col1").filter("col1 in ('happy')")
>>>>>   }
>>>>>   override def copy(extra: ParamMap): Transformer = ???
>>>>>   @DeveloperApi
>>>>>   override def transformSchema(schema: StructType): StructType ={
>>>>>    schema
>>>>>   }
>>>>> }
>>>>>
>>>>>
>>>>> Usage
>>>>>
>>>>> val data = sc.parallelize(List("abce","happy")).toDF("col1")
>>>>> val trans = new CategoryTransformer("1")
>>>>> data.show()
>>>>> trans.transform(data).show()
>>>>>
>>>>>
>>>>> This transformer will make sure , you always have values in col1 as
>>>>> provided by you.
>>>>>
>>>>>
>>>>> Regards
>>>>> Pralabh Kumar
>>>>>
>>>>> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah <
>>>>> saatvikshah1...@gmail.com> wrote:
>>>>>
>>>>>> Hi Pralabh,
>>>>>>
>>>>>> I want the ability to create a column such that its values be
>>>>>> restricted to a specific set of predefined values.
>>>>>> For example, suppose I have a column called EMOTION: I want to ensure
>>>>>> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Saatvik Shah
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <
>>>>>> pralabhku...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi satvik
>>>>>>>
>>>>>>> Can u please provide an example of what exactly you want.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" <saatvikshah1...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Yan,
>>>>>>>>
>>>>>>>> Basically the reason I was looking for the categorical datatype is
>>>>>>>> as given here
>>>>>>>> <https://pandas.pydata.org/pandas-docs/stable/categorical.html>:
>>>>>>>> ability to fix column values to specific categories. Is it possible to
>>>>>>>> create a user defined data type which could do so?
>>>>>>>>
>>>>>>>> Thanks and Regards,
>>>>>>>> Saatvik Shah
>>>>>>>>
>>>>>>>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <
>>>>>>>> facai....@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> You can use some Transformers to handle categorical data,
>>>>>>>>> For example,
>>>>>>>>> StringIndexer encodes a string column of labels to a column of
>>>>>>>>> label indices:
>>>>>>>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>>>>>>>>> saatvikshah1...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> I'm trying to convert a Pandas -> Spark dataframe. One of the
>>>>>>>>>> columns I have
>>>>>>>>>> is of the Category type in Pandas. But there does not seem to be
>>>>>>>>>> support for
>>>>>>>>>> this same type in Spark. What is the best alternative?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> View this message in context: http://apache-spark-user-list.
>>>>>>>>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>>>>>>>>>> Spark-Dataframe-tp28764.html
>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>> Nabble.com.
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------
>>>>>>>>>> ---------
>>>>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> *Saatvik Shah,*
>>>>>>>> *1st  Year,*
>>>>>>>> *Masters in the School of Computer Science,*
>>>>>>>> *Carnegie Mellon University*
>>>>>>>>
>>>>>>>> *https://saatvikshah1994.github.io/
>>>>>>>> <https://saatvikshah1994.github.io/>*
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Saatvik Shah,*
>>>>>> *1st  Year,*
>>>>>> *Masters in the School of Computer Science,*
>>>>>> *Carnegie Mellon University*
>>>>>>
>>>>>> *https://saatvikshah1994.github.io/
>>>>>> <https://saatvikshah1994.github.io/>*
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
*Saatvik Shah,*
*1st  Year,*
*Masters in the School of Computer Science,*
*Carnegie Mellon University*

*https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*

Re: Best alternative for Category Type in Spark Dataframe

Reply via email to