[ 
https://issues.apache.org/jira/browse/SPARK-11003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949234#comment-14949234
 ] 

John Muller commented on SPARK-11003:
-------------------------------------

Linking to other UDT enchancments

> Allowing UserDefinedTypes to extend primatives
> ----------------------------------------------
>
>                 Key: SPARK-11003
>                 URL: https://issues.apache.org/jira/browse/SPARK-11003
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.5.0, 1.5.1
>            Reporter: John Muller
>            Priority: Minor
>              Labels: DataType, UDT
>
> Currently, the classes and constructors of all the primative DataTypes (of 
> StructFields) are private:
> https://github.com/apache/spark/tree/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types
> Which means for even simple String-based UDTs users will always have to 
> implement serialize() and deserialize().  UDTs for something as simple as a 
> Northwind database (products, orders, customers) would be very useful for 
> pattern matching / validation.  For example:
> import org.apache.spark.sql.types._
> @SQLUserDefinedType(udt = classOf[ProductNameUDT])
> case class ProductName(name: String) extends StringType with Validator {
>   import scala.util.matching.Regex
>   private val pattern = """[A-Z][A-Za-z]*"""
>   def validate(): Boolean = {
>     name match {
>         case pattern(_*) => true
>         case _ => false
>       }
>   }
> }
> class ProductNameUDT extends UserDefinedType[ProductName] {
>   // No need for this; ProductName is a StringType so we know how to 
> deserialize
>   override def serialize(p: Any): Any = {
>     p match {
>       case p: ProductName => Seq(p.name)
>     }
>   }
>   
>   // Not sure why this override is needed at all; can't we always get this 
> simply by the UDT type param?
>   override def userClass: Class[ProductName] = classOf[ProductName]
>   
>   // Instead of the below, just infer the StructField name via reflection of 
> the wrapper class' name
>   override def sqlType: DataType = StructType(Seq(StructField("ProductName", 
> StringType)))
>   // Still needed.
>   override def deserialize(datum: Any): ProductName = {
>     datum match {
>       case values: Seq[_] =>
>         assert(values.length == 1)
>         ProductName(values.head.asInstanceOf[String])
>     }
>   }
> }
> This would simplify the process of creating "primative extension" UDTs down 
> to just 2 steps:
> 1. Annotated case class that extends a primative DataType
> 2. The UDT itself just needs a deserializer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to