Josh Rosen created SPARK-15680: ---------------------------------- Summary: Disable comments in generated code in order to avoid performance issues Key: SPARK-15680 URL: https://issues.apache.org/jira/browse/SPARK-15680 Project: Spark Issue Type: Bug Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen
In benchmarks involving tables with very wide and complex schemas (thousands of columns, deep nesting), I noticed that significant amounts of time (order of tens of seconds per task) were being spent generating comments during the code generation phase. The root cause of the performance problem stems from the fact that calling {{toString()}} on a complex expression can involve thousands of string concatenations, resulting in huge amounts (tens of gigabytes) of character array allocation and copying (see attached profiler screenshots) In the long term, we can avoid this problem by passing StringBuilders down the tree and using them to accumulate output. In the short term, however, I think that we should just disable comments in the generated code by default since very long comments are typically not useful debugging aids (since they're truncated for display anyways). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org