Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144564741
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
    @@ -277,13 +292,25 @@ class CodegenContext {
           funcName: String,
           funcCode: String,
           inlineToOuterClass: Boolean = false): String = {
    +    val newFunction = addNewFunctionInternal(funcName, funcCode, 
inlineToOuterClass)
    +    newFunction match {
    +      case NewFunction(functionName, None, None) => functionName
    +      case NewFunction(functionName, Some(_), Some(subclassInstance)) =>
    +        subclassInstance + "." + functionName
    +    }
    +  }
    +
    +  private[this] def addNewFunctionInternal(
    +      funcName: String,
    +      funcCode: String,
    +      inlineToOuterClass: Boolean): NewFunction = {
         // The number of named constants that can exist in the class is 
limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted 
for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be 
inlined to a private, nested
    +    // threshold of 1000k bytes to determine when a function should be 
inlined to a private, nested
         // sub-class.
         val (className, classInstance) = if (inlineToOuterClass) {
           outerClassName -> ""
    -    } else if (currClassSize > 1600000) {
    +    } else if (currClassSize > 1000000) {
    --- End diff --
    
    @gatorsmile it's a byte threshold, similar to the 1024 byte threshold set 
in `splitExpressions`. We can't know exactly how much code will contribute to 
the constant pool, that is, there's no easy static analysis we can perform on a 
block of code to say "this code will contribute `n` entries to the constant 
pool", we only know the size of the code is strongly correlated to entries in 
the constant pool. We're trying to keep the number of generated classes as low 
as possible while also grouping enough of the code to avoid the constant pool 
error. 
    
    In short,  I tested different types of schemas with many columns to find 
what the value could be set to empirically.
    
    There's no particular harm in setting the value lower as is done here if it 
helps us avoid a known constant pool error case. Doing so would effectively 
reduce the number of expressions each nested class holds, and so also increase 
the number of nested classes in total. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to