Re: Need some help to create user defined type for ML pipeline

Jaonary Rabarisoa Mon, 23 Feb 2015 08:16:29 -0800

Hi Joseph,

Thank you for you feedback. I've managed to define an image type by
following VectorUDT implementation.


I have another question about the definition of a user defined transformer.
The unary tranfromer is private to spark ml. Do you plan
to give a developer api for transformers ?



On Sun, Jan 25, 2015 at 2:26 AM, Joseph Bradley <jos...@databricks.com>
wrote:

> Hi Jao,
>
> You're right that defining serialize and deserialize is the main task in
> implementing a UDT.  They are basically translating between your native
> representation (ByteImage) and SQL DataTypes.  The sqlType you defined
> looks correct, and you're correct to use a row of length 4.  Other than
> that, it should just require copying data to and from SQL Rows.  There are
> quite a few examples of that in the codebase; I'd recommend searching based
> on the particular DataTypes you're using.
>
> Are there particular issues you're running into?
>
> Joseph
>
> On Mon, Jan 19, 2015 at 12:59 AM, Jaonary Rabarisoa <jaon...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I'm trying to implement a pipeline for computer vision based on the
>> latest ML package in spark. The first step of my pipeline is to decode
>> image (jpeg for instance) stored in a parquet file.
>> For this, I begin to create a UserDefinedType that represents a decoded
>> image stored in a array of byte. Here is my first attempt :
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *@SQLUserDefinedType(udt = classOf[ByteImageUDT])class ByteImage(channels: 
>> Int, width: Int, height: Int, data: Array[Byte])private[spark] class 
>> ByteImageUDT extends UserDefinedType[ByteImage] {  override def sqlType: 
>> StructType = {    // type: 0 = sparse, 1 = dense    // We only use "values" 
>> for dense vectors, and "size", "indices", and "values" for sparse    // 
>> vectors. The "values" field is nullable because we might want to add binary 
>> vectors later,    // which uses "size" and "indices", but not "values".    
>> StructType(Seq(      StructField("channels", IntegerType, nullable = false), 
>>      StructField("width", IntegerType, nullable = false),      
>> StructField("height", IntegerType, nullable = false),      
>> StructField("data", BinaryType, nullable = false)  }  override def 
>> serialize(obj: Any): Row = {    val row = new GenericMutableRow(4)    val 
>> img = obj.asInstanceOf[ByteImage]*
>>
>>
>>
>>
>>
>>
>> *...  }  override def deserialize(datum: Any): Vector = {  *
>>
>> *....*
>>
>>
>>
>>
>>
>>
>>
>>
>> *    }  }  override def pyUDT: String = "pyspark.mllib.linalg.VectorUDT"  
>> override def userClass: Class[Vector] = classOf[Vector]}*
>>
>>
>> I take the VectorUDT as a starting point but there's a lot of thing that I 
>> don't really understand. So any help on defining serialize and deserialize 
>> methods will be appreciated.
>>
>> Best Regards,
>>
>> Jao
>>
>>
>

Re: Need some help to create user defined type for ML pipeline

Reply via email to