Thanks Herman for the explanation. I silently assume that the other points were ok since you did not object? Correct?
Pozdrawiam, Jacek Laskowski ---- https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Tue, Jan 3, 2017 at 3:06 PM, Herman van Hövell tot Westerflier < hvanhov...@databricks.com> wrote: > @Jacek The maximum output of 200 fields for whole stage code generation > has been chosen to prevent the code generated method from exceeding the > 64kb code limit. There absolutely no relation between this value and the > number of partitions after a shuffle (if there were they should have used > the same configuration). > > On Tue, Jan 3, 2017 at 1:55 PM, Jacek Laskowski <ja...@japila.pl> wrote: > >> Hi Shuai, >> >> Disclaimer: I'm not a spark guru, and what's written below are some >> notes I took when reading spark source code, so I could be wrong, in >> which case I'd appreciate a lot if someone could correct me. >> >> (Yes, I did copy your disclaimer since it applies to me too. Sorry for >> duplication :)) >> >> I'd say that the description is very well-written and clear. I'd only add >> that: >> >> 1. CodegenSupport allows custom implementations to optionally disable >> codegen using supportCodegen predicate (that is enabled by default, >> i.e. true) >> 2. CollapseCodegenStages is a Rule[SparkPlan], i.e. a transformation >> of SparkPlan into another SparkPlan, that searches for sub-plans (aka >> stages) that support codegen and collapse them together as a >> WholeStageCodegen for which supportCodegen is enabled. >> 3. It is assumed that all Expression instances except CodegenFallback >> support codegen. >> 4. CollapseCodegenStages uses the internal setting >> spark.sql.codegen.maxFields (default: 200) to control the number of >> fields in input and output schemas before deactivating whole-stage >> codegen. See https://issues.apache.org/jira/browse/SPARK-14554. >> >> NOTE: The magic number 200 (!) again. I asked about it few days ago >> and in http://stackoverflow.com/questions/41359344/why-is-the-numbe >> r-of-partitions-after-groupby-200 >> >> 5. There are side-effecting logical commands that are executed for >> their side-effects that are translated to ExecutedCommandExec in >> BasicOperators strategy and won't take part in codegen. >> >> Thanks for sharing your notes! Gonna merge yours with mine! Thanks. >> >> Pozdrawiam, >> Jacek Laskowski >> ---- >> https://medium.com/@jaceklaskowski/ >> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark >> Follow me at https://twitter.com/jaceklaskowski >> >> >> On Mon, Jan 2, 2017 at 6:30 PM, Shuai Lin <linshuai2...@gmail.com> wrote: >> > Disclaimer: I'm not a spark guru, and what's written below are some >> notes I >> > took when reading spark source code, so I could be wrong, in which case >> I'd >> > appreciate a lot if someone could correct me. >> > >> >> >> >> > Let me rephrase this. How does the SparkSQL engine call the codegen >> APIs >> >> > to >> >> do the job of producing RDDs? >> > >> > >> > IIUC, physical operators like `ProjectExec` implements >> doProduce/doConsume >> > to support codegen, and when whole-stage codegen is enabled, a subtree >> would >> > be collapsed into a WholeStageCodegenExec wrapper tree, and the root >> node of >> > the wrapper tree would call the doProduce/doConsume method of each >> operator >> > to generate the java source code to be compiled into java byte code by >> > janino. >> > >> > In contrast, when whole stage code gen is disabled (e.g. by passing >> "--conf >> > spark.sql.codegen.wholeStage=false" to spark submit), the doExecute >> method >> > of the physical operators are called so no code generation would happen. >> > >> > The producing of the RDDs is some post-order SparkPlan tree evaluation. >> The >> > leaf node would be some data source: either some file-based >> > HadoopFsRelation, or some external data sources like JdbcRelation, or >> > in-memory LocalRelation created by "spark.range(100)". Above all, the >> leaf >> > nodes could produce rows on their own. Then the evaluation goes in a >> bottom >> > up manner, applying filter/limit/project etc. along the way. The >> generated >> > code or the various doExecute method would be called, depending on >> whether >> > codegen is enabled (the default) or not. >> > >> >> > What are those eval methods in Expressions for given there's already >> a >> >> > doGenCode next to it? >> > >> > >> > AFAIK the `eval` method of Expression is used to do static evaluation >> when >> > the expression is foldable, e.g.: >> > >> > select map('a', 1, 'b', 2, 'a', 3) as m >> > >> > Regards, >> > Shuai >> > >> > >> > On Wed, Dec 28, 2016 at 1:05 PM, dragonly <liyilon...@gmail.com> wrote: >> >> >> >> Thanks for your reply! >> >> >> >> Here's my *understanding*: >> >> basic types that ScalaReflection understands are encoded into tungsten >> >> binary format, while UDTs are encoded into GenericInternalRow, which >> >> stores >> >> the JVM objects in an Array[Any] under the hood, and thus lose those >> >> memory >> >> footprint efficiency and cpu cache efficiency stuff provided by >> tungsten >> >> encoding. >> >> >> >> If the above is correct, then here are my *further questions*: >> >> Are SparkPlan nodes (those ends with Exec) all codegenerated before >> >> actually >> >> running the toRdd logic? I know there are some non-codegenable nodes >> which >> >> implement trait CodegenFallback, but there's also a doGenCode method in >> >> the >> >> trait, so the actual calling convention really puzzles me. And I've >> tried >> >> to >> >> trace those calling flow for a few days but found them scattered every >> >> where. I cannot make a big graph of the method calling order even with >> the >> >> help of IntelliJ. >> >> >> >> Let me rephrase this. How does the SparkSQL engine call the codegen >> APIs >> >> to >> >> do the job of producing RDDs? What are those eval methods in >> Expressions >> >> for >> >> given there's already a doGenCode next to it? >> >> >> >> >> >> >> >> -- >> >> View this message in context: >> >> http://apache-spark-developers-list.1001551.n3.nabble.com/ >> What-is-mainly-different-from-a-UDT-and-a-spark-internal- >> type-that-ExpressionEncoder-recognized-tp20370p20376.html >> >> Sent from the Apache Spark Developers List mailing list archive at >> >> Nabble.com. >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> >> > >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> > > > -- > > Herman van Hövell > > Software Engineer > > Databricks Inc. > > hvanhov...@databricks.com > > +31 6 420 590 27 > > databricks.com > > [image: http://databricks.com] <http://databricks.com/> >