Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Shuai Lin Mon, 02 Jan 2017 09:30:44 -0800

Disclaimer: I'm not a spark guru, and what's written below are some notes I
took when reading spark source code, so I could be wrong, in which case I'd
appreciate a lot if someone could correct me.

> > Let me rephrase this. How does the SparkSQL engine call the codegen APIs
> to
> do the job of producing RDDs?

IIUC, physical operators like `ProjectExec` implements doProduce/doConsume
to support codegen, and when whole-stage codegen is enabled, a subtree
would be collapsed into a WholeStageCodegenExec wrapper tree, and the root
node of the wrapper tree would call the doProduce/doConsume method of each
operator to generate the java source code to be compiled into java byte
code by janino.

In contrast, when whole stage code gen is disabled (e.g. by passing "--conf
spark.sql.codegen.wholeStage=false" to spark submit), the doExecute method
of the physical operators are called so no code generation would happen.

The producing of the RDDs is some post-order SparkPlan tree evaluation. The
leaf node would be some data source: either some file-based
HadoopFsRelation, or some external data sources like JdbcRelation, or
in-memory LocalRelation created by "spark.range(100)". Above all, the leaf
nodes could produce rows on their own. Then the evaluation goes in a bottom
up manner, applying filter/limit/project etc. along the way. The generated
code or the various doExecute method would be called, depending on whether
codegen is enabled (the default) or not.

> What are those eval methods in Expressions for given there's already a
> doGenCode next to it?

AFAIK the `eval` method of Expression is used to do static evaluation when
the expression is foldable, e.g.:

   select map('a', 1, 'b', 2, 'a', 3) as m

Regards,
Shuai

On Wed, Dec 28, 2016 at 1:05 PM, dragonly <[email protected]> wrote:

> Thanks for your reply!
>
> Here's my *understanding*:
> basic types that ScalaReflection understands are encoded into tungsten
> binary format, while UDTs are encoded into GenericInternalRow, which stores
> the JVM objects in an Array[Any] under the hood, and thus lose those memory
> footprint efficiency and cpu cache efficiency stuff provided by tungsten
> encoding.
>
> If the above is correct, then here are my *further questions*:
> Are SparkPlan nodes (those ends with Exec) all codegenerated before
> actually
> running the toRdd logic? I know there are some non-codegenable nodes which
> implement trait CodegenFallback, but there's also a doGenCode method in the
> trait, so the actual calling convention really puzzles me. And I've tried
> to
> trace those calling flow for a few days but found them scattered every
> where. I cannot make a big graph of the method calling order even with the
> help of IntelliJ.
>
> Let me rephrase this. How does the SparkSQL engine call the codegen APIs to
> do the job of producing RDDs? What are those eval methods in Expressions
> for
> given there's already a doGenCode next to it?
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/What-is-mainly-
> different-from-a-UDT-and-a-spark-internal-type-that-
> ExpressionEncoder-recognized-tp20370p20376.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Reply via email to