Github user bdrillard commented on a diff in the pull request: https://github.com/apache/spark/pull/19480#discussion_r144564741 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -277,13 +292,25 @@ class CodegenContext { funcName: String, funcCode: String, inlineToOuterClass: Boolean = false): String = { + val newFunction = addNewFunctionInternal(funcName, funcCode, inlineToOuterClass) + newFunction match { + case NewFunction(functionName, None, None) => functionName + case NewFunction(functionName, Some(_), Some(subclassInstance)) => + subclassInstance + "." + functionName + } + } + + private[this] def addNewFunctionInternal( + funcName: String, + funcCode: String, + inlineToOuterClass: Boolean): NewFunction = { // The number of named constants that can exist in the class is limited by the Constant Pool // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a - // threshold of 1600k bytes to determine when a function should be inlined to a private, nested + // threshold of 1000k bytes to determine when a function should be inlined to a private, nested // sub-class. val (className, classInstance) = if (inlineToOuterClass) { outerClassName -> "" - } else if (currClassSize > 1600000) { + } else if (currClassSize > 1000000) { --- End diff -- @gatorsmile it's a byte threshold, similar to the 1024 byte threshold set in `splitExpressions`. We can't know exactly how much code will contribute to the constant pool, that is, there's no easy static analysis we can perform on a block of code to say "this code will contribute `n` entries to the constant pool", we only know the size of the code is strongly correlated to entries in the constant pool. We're trying to keep the number of generated classes as low as possible while also grouping enough of the code to avoid the constant pool error. In short, I tested different types of schemas with many columns to find what the value could be set to empirically. There's no particular harm in setting the value lower as is done here if it helps us avoid a known constant pool error case. Doing so would effectively reduce the number of expressions each nested class holds, and so also increase the number of nested classes in total.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org