Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Liang-Chi Hsieh Tue, 03 Jan 2017 23:21:58 -0800

Actually, I think UDTs can directly translates an object into Spark's
internal format by ScalaReflection and encoder, without the intermediate
generic row. You can directly create a dataset of the objects of UDT.


If you don't convert the dataset to a dataframe, I think RowEncoder won't
step in.



Michael Armbrust wrote
> An encoder uses reflection
> &lt;https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala&gt;
> to generate expressions that can extract data out of an object (by calling
> methods on the object) and encode its contents directly into the tungsten
> binary row format (and vice versa).  We codegenerate bytecode that
> evaluates these expression in the same way that we code generate code for
> normal expression evaluation in query processing.  However, this
> reflection
> only works for simple ATDs
> &lt;https://en.wikipedia.org/wiki/Algebraic_data_type&gt;.  Another key
> thing to
> realize is that we do this reflection / code generation at runtime, so we
> aren't constrained by binary compatibility across versions of spark.
> 
> UDTs let you write custom code that translates an object into into a
> generic row, which we can then translate into Spark's internal format
> (using a RowEncoder). Unlike expressions and tungsten binary encoding, the
> Row type that you return here is a stable public API that hasn't changed
> since Spark 1.3.
> 
> So to summarize, if encoders don't work for your specific types you can
> use
> UDTs, but they probably won't be as efficient. I'd love to unify these
> code
> paths more, but its actually a fair amount of work to come up with a good
> stable public API that doesn't sacrifice performance.
> 
> On Tue, Dec 27, 2016 at 6:32 AM, dragonly &lt;

> liyilongko@

> &gt; wrote:
> 
>> I'm recently reading the source code of the SparkSQL project, and found
>> some
>> interesting databricks blogs about the tungsten project. I've roughly
>> read
>> through the encoder and unsafe representation part of the tungsten
>> project(haven't read the algorithm part such as cache friendly hashmap
>> algorithms).
>> Now there's a big puzzle in front of me about the codegen of SparkSQL and
>> how does the codegen utilize the tungsten encoding between JMV objects
>> and
>> unsafe bits.
>> So can anyone tell me that's the main difference in situations where I
>> write
>> a UDT like ExamplePointUDT in SparkSQL or just create an ArrayType which
>> can
>> be handled by the tungsten encoder? I'll really appreciate it if you can
>> go
>> through some concrete code examples. thanks a lot!
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/What-is-mainly-
>> different-from-a-UDT-and-a-spark-internal-type-that-
>> ExpressionEncoder-recognized-tp20370.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-mainly-different-from-a-UDT-and-a-spark-internal-type-that-ExpressionEncoder-recognized-tp20370p20448.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Reply via email to