Josh Rosen created SPARK-15680:
----------------------------------

             Summary: Disable comments in generated code in order to avoid 
performance issues
                 Key: SPARK-15680
                 URL: https://issues.apache.org/jira/browse/SPARK-15680
             Project: Spark
          Issue Type: Bug
          Components: SQL
            Reporter: Josh Rosen
            Assignee: Josh Rosen


In benchmarks involving tables with very wide and complex schemas (thousands of 
columns, deep nesting), I noticed that significant amounts of time (order of 
tens of seconds per task) were being spent generating comments during the code 
generation phase.

The root cause of the performance problem stems from the fact that calling 
{{toString()}} on a complex expression can involve thousands of string 
concatenations, resulting in huge amounts (tens of gigabytes) of character 
array allocation and copying (see attached profiler screenshots)

In the long term, we can avoid this problem by passing StringBuilders down the 
tree and using them to accumulate output. In the short term, however, I think 
that we should just disable comments in the generated code by default since 
very long comments are typically not useful debugging aids (since they're 
truncated for display anyways).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org


Reply via email to