[ https://issues.apache.org/jira/browse/SPARK-11003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949234#comment-14949234 ]
John Muller commented on SPARK-11003: ------------------------------------- Linking to other UDT enchancments > Allowing UserDefinedTypes to extend primatives > ---------------------------------------------- > > Key: SPARK-11003 > URL: https://issues.apache.org/jira/browse/SPARK-11003 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.5.0, 1.5.1 > Reporter: John Muller > Priority: Minor > Labels: DataType, UDT > > Currently, the classes and constructors of all the primative DataTypes (of > StructFields) are private: > https://github.com/apache/spark/tree/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types > Which means for even simple String-based UDTs users will always have to > implement serialize() and deserialize(). UDTs for something as simple as a > Northwind database (products, orders, customers) would be very useful for > pattern matching / validation. For example: > import org.apache.spark.sql.types._ > @SQLUserDefinedType(udt = classOf[ProductNameUDT]) > case class ProductName(name: String) extends StringType with Validator { > import scala.util.matching.Regex > private val pattern = """[A-Z][A-Za-z]*""" > def validate(): Boolean = { > name match { > case pattern(_*) => true > case _ => false > } > } > } > class ProductNameUDT extends UserDefinedType[ProductName] { > // No need for this; ProductName is a StringType so we know how to > deserialize > override def serialize(p: Any): Any = { > p match { > case p: ProductName => Seq(p.name) > } > } > > // Not sure why this override is needed at all; can't we always get this > simply by the UDT type param? > override def userClass: Class[ProductName] = classOf[ProductName] > > // Instead of the below, just infer the StructField name via reflection of > the wrapper class' name > override def sqlType: DataType = StructType(Seq(StructField("ProductName", > StringType))) > // Still needed. > override def deserialize(datum: Any): ProductName = { > datum match { > case values: Seq[_] => > assert(values.length == 1) > ProductName(values.head.asInstanceOf[String]) > } > } > } > This would simplify the process of creating "primative extension" UDTs down > to just 2 steps: > 1. Annotated case class that extends a primative DataType > 2. The UDT itself just needs a deserializer -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org