Hi Joseph, Thank you for you feedback. I've managed to define an image type by following VectorUDT implementation.
I have another question about the definition of a user defined transformer. The unary tranfromer is private to spark ml. Do you plan to give a developer api for transformers ? On Sun, Jan 25, 2015 at 2:26 AM, Joseph Bradley <jos...@databricks.com> wrote: > Hi Jao, > > You're right that defining serialize and deserialize is the main task in > implementing a UDT. They are basically translating between your native > representation (ByteImage) and SQL DataTypes. The sqlType you defined > looks correct, and you're correct to use a row of length 4. Other than > that, it should just require copying data to and from SQL Rows. There are > quite a few examples of that in the codebase; I'd recommend searching based > on the particular DataTypes you're using. > > Are there particular issues you're running into? > > Joseph > > On Mon, Jan 19, 2015 at 12:59 AM, Jaonary Rabarisoa <jaon...@gmail.com> > wrote: > >> Hi all, >> >> I'm trying to implement a pipeline for computer vision based on the >> latest ML package in spark. The first step of my pipeline is to decode >> image (jpeg for instance) stored in a parquet file. >> For this, I begin to create a UserDefinedType that represents a decoded >> image stored in a array of byte. Here is my first attempt : >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> *@SQLUserDefinedType(udt = classOf[ByteImageUDT])class ByteImage(channels: >> Int, width: Int, height: Int, data: Array[Byte])private[spark] class >> ByteImageUDT extends UserDefinedType[ByteImage] { override def sqlType: >> StructType = { // type: 0 = sparse, 1 = dense // We only use "values" >> for dense vectors, and "size", "indices", and "values" for sparse // >> vectors. The "values" field is nullable because we might want to add binary >> vectors later, // which uses "size" and "indices", but not "values". >> StructType(Seq( StructField("channels", IntegerType, nullable = false), >> StructField("width", IntegerType, nullable = false), >> StructField("height", IntegerType, nullable = false), >> StructField("data", BinaryType, nullable = false) } override def >> serialize(obj: Any): Row = { val row = new GenericMutableRow(4) val >> img = obj.asInstanceOf[ByteImage]* >> >> >> >> >> >> >> *... } override def deserialize(datum: Any): Vector = { * >> >> *....* >> >> >> >> >> >> >> >> >> * } } override def pyUDT: String = "pyspark.mllib.linalg.VectorUDT" >> override def userClass: Class[Vector] = classOf[Vector]}* >> >> >> I take the VectorUDT as a starting point but there's a lot of thing that I >> don't really understand. So any help on defining serialize and deserialize >> methods will be appreciated. >> >> Best Regards, >> >> Jao >> >> >