Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Jacek Laskowski Tue, 03 Jan 2017 06:30:47 -0800

Thanks Herman for the explanation.

I silently assume that the other points were ok since you did not object?
Correct?



Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Tue, Jan 3, 2017 at 3:06 PM, Herman van Hövell tot Westerflier <
hvanhov...@databricks.com> wrote:

> @Jacek The maximum output of 200 fields for whole stage code generation
> has been chosen to prevent the code generated method from exceeding the
> 64kb code limit. There absolutely no relation between this value and the
> number of partitions after a shuffle (if there were they should have used
> the same configuration).
>
> On Tue, Jan 3, 2017 at 1:55 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi Shuai,
>>
>> Disclaimer: I'm not a spark guru, and what's written below are some
>> notes I took when reading spark source code, so I could be wrong, in
>> which case I'd appreciate a lot if someone could correct me.
>>
>> (Yes, I did copy your disclaimer since it applies to me too. Sorry for
>> duplication :))
>>
>> I'd say that the description is very well-written and clear. I'd only add
>> that:
>>
>> 1. CodegenSupport allows custom implementations to optionally disable
>> codegen using supportCodegen predicate (that is enabled by default,
>> i.e. true)
>> 2. CollapseCodegenStages is a Rule[SparkPlan], i.e. a transformation
>> of SparkPlan into another SparkPlan, that searches for sub-plans (aka
>> stages) that support codegen and collapse them together as a
>> WholeStageCodegen for which supportCodegen is enabled.
>> 3. It is assumed that all Expression instances except CodegenFallback
>> support codegen.
>> 4. CollapseCodegenStages uses the internal setting
>> spark.sql.codegen.maxFields (default: 200) to control the number of
>> fields in input and output schemas before deactivating whole-stage
>> codegen. See https://issues.apache.org/jira/browse/SPARK-14554.
>>
>> NOTE: The magic number 200 (!) again. I asked about it few days ago
>> and in http://stackoverflow.com/questions/41359344/why-is-the-numbe
>> r-of-partitions-after-groupby-200
>>
>> 5. There are side-effecting logical commands that are executed for
>> their side-effects that are translated to ExecutedCommandExec in
>> BasicOperators strategy and won't take part in codegen.
>>
>> Thanks for sharing your notes! Gonna merge yours with mine! Thanks.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Mon, Jan 2, 2017 at 6:30 PM, Shuai Lin <linshuai2...@gmail.com> wrote:
>> > Disclaimer: I'm not a spark guru, and what's written below are some
>> notes I
>> > took when reading spark source code, so I could be wrong, in which case
>> I'd
>> > appreciate a lot if someone could correct me.
>> >
>> >>
>> >> > Let me rephrase this. How does the SparkSQL engine call the codegen
>> APIs
>> >> > to
>> >> do the job of producing RDDs?
>> >
>> >
>> > IIUC, physical operators like `ProjectExec` implements
>> doProduce/doConsume
>> > to support codegen, and when whole-stage codegen is enabled, a subtree
>> would
>> > be collapsed into a WholeStageCodegenExec wrapper tree, and the root
>> node of
>> > the wrapper tree would call the doProduce/doConsume method of each
>> operator
>> > to generate the java source code to be compiled into java byte code by
>> > janino.
>> >
>> > In contrast, when whole stage code gen is disabled (e.g. by passing
>> "--conf
>> > spark.sql.codegen.wholeStage=false" to spark submit), the doExecute
>> method
>> > of the physical operators are called so no code generation would happen.
>> >
>> > The producing of the RDDs is some post-order SparkPlan tree evaluation.
>> The
>> > leaf node would be some data source: either some file-based
>> > HadoopFsRelation, or some external data sources like JdbcRelation, or
>> > in-memory LocalRelation created by "spark.range(100)". Above all, the
>> leaf
>> > nodes could produce rows on their own. Then the evaluation goes in a
>> bottom
>> > up manner, applying filter/limit/project etc. along the way. The
>> generated
>> > code or the various doExecute method would be called, depending on
>> whether
>> > codegen is enabled (the default) or not.
>> >
>> >> > What are those eval methods in Expressions for given there's already
>> a
>> >> > doGenCode next to it?
>> >
>> >
>> > AFAIK the `eval` method of Expression is used to do static evaluation
>> when
>> > the expression is foldable, e.g.:
>> >
>> >    select map('a', 1, 'b', 2, 'a', 3) as m
>> >
>> > Regards,
>> > Shuai
>> >
>> >
>> > On Wed, Dec 28, 2016 at 1:05 PM, dragonly <liyilon...@gmail.com> wrote:
>> >>
>> >> Thanks for your reply!
>> >>
>> >> Here's my *understanding*:
>> >> basic types that ScalaReflection understands are encoded into tungsten
>> >> binary format, while UDTs are encoded into GenericInternalRow, which
>> >> stores
>> >> the JVM objects in an Array[Any] under the hood, and thus lose those
>> >> memory
>> >> footprint efficiency and cpu cache efficiency stuff provided by
>> tungsten
>> >> encoding.
>> >>
>> >> If the above is correct, then here are my *further questions*:
>> >> Are SparkPlan nodes (those ends with Exec) all codegenerated before
>> >> actually
>> >> running the toRdd logic? I know there are some non-codegenable nodes
>> which
>> >> implement trait CodegenFallback, but there's also a doGenCode method in
>> >> the
>> >> trait, so the actual calling convention really puzzles me. And I've
>> tried
>> >> to
>> >> trace those calling flow for a few days but found them scattered every
>> >> where. I cannot make a big graph of the method calling order even with
>> the
>> >> help of IntelliJ.
>> >>
>> >> Let me rephrase this. How does the SparkSQL engine call the codegen
>> APIs
>> >> to
>> >> do the job of producing RDDs? What are those eval methods in
>> Expressions
>> >> for
>> >> given there's already a doGenCode next to it?
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >> http://apache-spark-developers-list.1001551.n3.nabble.com/
>> What-is-mainly-different-from-a-UDT-and-a-spark-internal-
>> type-that-ExpressionEncoder-recognized-tp20370p20376.html
>> >> Sent from the Apache Spark Developers List mailing list archive at
>> >> Nabble.com.
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
>
> --
>
> Herman van Hövell
>
> Software Engineer
>
> Databricks Inc.
>
> hvanhov...@databricks.com
>
> +31 6 420 590 27
>
> databricks.com
>
> [image: http://databricks.com] <http://databricks.com/>
>

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Reply via email to