Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2017-01-03 Thread Liang-Chi Hsieh
Actually, I think UDTs can directly translates an object into Spark's internal format by ScalaReflection and encoder, without the intermediate generic row. You can directly create a dataset of the objects of UDT. If you don't convert the dataset to a dataframe, I think RowEncoder won't step in.

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2017-01-03 Thread Jacek Laskowski
Thanks Herman for the explanation. I silently assume that the other points were ok since you did not object? Correct? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2017-01-03 Thread Herman van Hövell tot Westerflier
@Jacek The maximum output of 200 fields for whole stage code generation has been chosen to prevent the code generated method from exceeding the 64kb code limit. There absolutely no relation between this value and the number of partitions after a shuffle (if there were they should have used the

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2017-01-03 Thread Jacek Laskowski
Hi Shuai, Disclaimer: I'm not a spark guru, and what's written below are some notes I took when reading spark source code, so I could be wrong, in which case I'd appreciate a lot if someone could correct me. (Yes, I did copy your disclaimer since it applies to me too. Sorry for duplication :))

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2017-01-02 Thread Shuai Lin
Disclaimer: I'm not a spark guru, and what's written below are some notes I took when reading spark source code, so I could be wrong, in which case I'd appreciate a lot if someone could correct me. > > Let me rephrase this. How does the SparkSQL engine call the codegen APIs > to > do the job of

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2016-12-27 Thread dragonly
Thanks for your reply! Here's my *understanding*: basic types that ScalaReflection understands are encoded into tungsten binary format, while UDTs are encoded into GenericInternalRow, which stores the JVM objects in an Array[Any] under the hood, and thus lose those memory footprint efficiency and

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2016-12-27 Thread Michael Armbrust
An encoder uses reflection to generate expressions that can extract data out of an object (by calling methods on the object) and encode its contents directly into the

What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2016-12-27 Thread dragonly
I'm recently reading the source code of the SparkSQL project, and found some interesting databricks blogs about the tungsten project. I've roughly read through the encoder and unsafe representation part of the tungsten project(haven't read the algorithm part such as cache friendly hashmap