Actually, I think UDTs can directly translates an object into Spark's internal format by ScalaReflection and encoder, without the intermediate generic row. You can directly create a dataset of the objects of UDT.
If you don't convert the dataset to a dataframe, I think RowEncoder won't step in. Michael Armbrust wrote > An encoder uses reflection > <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala> > to generate expressions that can extract data out of an object (by calling > methods on the object) and encode its contents directly into the tungsten > binary row format (and vice versa). We codegenerate bytecode that > evaluates these expression in the same way that we code generate code for > normal expression evaluation in query processing. However, this > reflection > only works for simple ATDs > <https://en.wikipedia.org/wiki/Algebraic_data_type>. Another key > thing to > realize is that we do this reflection / code generation at runtime, so we > aren't constrained by binary compatibility across versions of spark. > > UDTs let you write custom code that translates an object into into a > generic row, which we can then translate into Spark's internal format > (using a RowEncoder). Unlike expressions and tungsten binary encoding, the > Row type that you return here is a stable public API that hasn't changed > since Spark 1.3. > > So to summarize, if encoders don't work for your specific types you can > use > UDTs, but they probably won't be as efficient. I'd love to unify these > code > paths more, but its actually a fair amount of work to come up with a good > stable public API that doesn't sacrifice performance. > > On Tue, Dec 27, 2016 at 6:32 AM, dragonly < > liyilongko@ > > wrote: > >> I'm recently reading the source code of the SparkSQL project, and found >> some >> interesting databricks blogs about the tungsten project. I've roughly >> read >> through the encoder and unsafe representation part of the tungsten >> project(haven't read the algorithm part such as cache friendly hashmap >> algorithms). >> Now there's a big puzzle in front of me about the codegen of SparkSQL and >> how does the codegen utilize the tungsten encoding between JMV objects >> and >> unsafe bits. >> So can anyone tell me that's the main difference in situations where I >> write >> a UDT like ExamplePointUDT in SparkSQL or just create an ArrayType which >> can >> be handled by the tungsten encoder? I'll really appreciate it if you can >> go >> through some concrete code examples. thanks a lot! >> >> >> >> -- >> View this message in context: http://apache-spark- >> developers-list.1001551.n3.nabble.com/What-is-mainly- >> different-from-a-UDT-and-a-spark-internal-type-that- >> ExpressionEncoder-recognized-tp20370.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: > [email protected] >> >> ----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-mainly-different-from-a-UDT-and-a-spark-internal-type-that-ExpressionEncoder-recognized-tp20370p20448.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: [email protected]
