[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

mgaido91 Sat, 07 Oct 2017 10:25:32 -0700

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    Thanks for your comment @kiszk. It make sense, actually, that a very small 
number like 10000 can cause exceptions. I will explain later why this happens, 
now I will start with a brief introduction about the problem, making a brief 
summary of my knowledge of the topic.
    
    Code generation can create too many entries for the constant pool of the 
generated classes. This happens when we have thousands of columns. Many things 
can add new entries in the constant pool. After the PR you referenced for 
SPARK-18016, we have many constant pool: one for the outer class and one for 
each of the inner classes. In particular, in Spark code generation, the problem 
is that in the constant pool there are entries for:
    
     1 - each attribute declared for a class: all the attributes are added to 
the outer class. Thus, here we have a limit. We should split them as well, but 
this is out of my focus now.
     2 - each method declared for a class: the number of methods is limited by 
splitting them among the inner classes. This is controlled by the "number" 
which is the subject of this PR.
     3 - each method which is referenced in a class: at the moment, even though 
the methods are declared in the inner classes, they are always used in the 
outer class. This is also a limit.
    
    The 1 point explains you the reason why a very low number like 10000 can 
create the problem. Since in the outer class we have an attribute for each 
nested class, if we have a lot of them we can reach the limit. With the current 
value or similar values (in the use case I am working on I set it to 1000000), 
even with thousands of columns the number of inner classes is very small, thus 
this is not a problem. But since we have multiple fields for each column, the 
number of columns is limited by this problem.
    
    In the use case I am working on I hit the problem 2 and 3, in the reverse 
order actually. I have a fix for the 3 too, even though I have not yet created 
a PR because I was not able to reproduce that problem either is a simple way 
which can be used in a UT and I was not sure about which might be the right 
approach for creating a PR for it (if someone has suggestion for this, please 
I'd really appreciate some advice). As I told, the use case is rather complex, 
and all my trials to reproduce it in an easy way were unsuccessful, since I 
always hit the 1 issue.
    
    Then, after this very large introduction, I hope that it is clear that this 
PR's goal is to address those cases in which we have a lot of very small 
functions and this is the driver to reach the constant pool's limit. So far I 
have been unsuccessful reproducing this issue is an easy way, I am very sorry.
    
    What I would be able to do, in a test case, is to look for the number of 
inner classes which are created (`CodegenContext.classes`): with a lower number 
they should be more that with a bigger one. If this is ok as UT I can do that.
    
    Hope that this was clear enough. Please, don't hesitate asking for 
clarifications if needed.
    
    Thanks.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Reply via email to