John Muller created SPARK-11003:
-----------------------------------

             Summary: Allowing UserDefinedTypes to extend primatives
                 Key: SPARK-11003
                 URL: https://issues.apache.org/jira/browse/SPARK-11003
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 1.5.1, 1.5.0
            Reporter: John Muller
            Priority: Minor


Currently, the classes and constructors of all the primative DataTypes (of 
StructFields) are private:

https://github.com/apache/spark/tree/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types

Which means for even simple String-based UDTs users will always have to 
implement serialize() and deserialize().  UDTs for something as simple as a 
Northwind database (products, orders, customers) would be very useful for 
pattern matching / validation.  For example:

import org.apache.spark.sql.types._
@SQLUserDefinedType(udt = classOf[ProductNameUDT])
case class ProductName(name: String) extends StringType with Validator {
  import scala.util.matching.Regex
  private val pattern = """[A-Z][A-Za-z]*"""
  def validate(): Boolean = {
    name match {
          case pattern(_*) => true
          case _ => false
        }
  }
}

class ProductNameUDT extends UserDefinedType[ProductName] {
  // No need for this; ProductName is a StringType so we know how to deserialize
  override def serialize(p: Any): Any = {
    p match {
      case p: ProductName => Seq(p.name)
    }
  }
  
  // Not sure why this override is needed at all; can't we always get this 
simply by the UDT type param?
  override def userClass: Class[ProductName] = classOf[ProductName]
  
  // Instead of the below, just infer the StructField name via reflection of 
the wrapper class' name
  override def sqlType: DataType = StructType(Seq(StructField("ProductName", 
StringType)))

  // Still needed.
  override def deserialize(datum: Any): ProductName = {
    datum match {
      case values: Seq[_] =>
        assert(values.length == 1)
        ProductName(values.head.asInstanceOf[String])
    }
  }
}

This would simplify the process of creating "primative extension" UDTs down to 
just 2 steps:
1. Annotated case class that extends a primative DataType
2. The UDT itself just needs a deserializer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to