Matthew Livesey created SPARK-16191: ---------------------------------------
Summary: Code-Generated SpecificColumnarIterator fails for wide pivot with caching Key: SPARK-16191 URL: https://issues.apache.org/jira/browse/SPARK-16191 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1 Reporter: Matthew Livesey When caching a pivot of more than 2260 columns, the instance of SpecificColumnarIterator which is generated by code-generation fails to be compiled with: bq. failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method \"()Z\" of class \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator\" grows beyond 64 KB This can be re-produced in PySpark with the following (it took some trial and error to find that 2261 is the magic number at which the generated class breaks the 64KB limit). {code} def build_pivot(width): categories = ["cat_%s" % i for i in range(0,width)] customers = ["cust_%s" % i for i in range(0,10)] rows = [] for cust in customers: for cat in categories: for i in range(0,4): row = (cust, cat, i, 7.0) rows.append(row) rdd = sc.parallelize(rows) df = sqlContext.createDataFrame(rdd, ["customer", "category", "instance", "value"]) pivot_value_rows = df.select("category").distinct().orderBy("category").collect() pivot_values = [r.category for r in pivot_value_rows] import pyspark.sql.functions as func pivot = df.groupBy('customer').pivot("category", pivot_values).agg(func.sum(df.value)).cache() pivot.write.save('my_pivot', mode='overwrite') for i in [2260, 2261]: try: build_pivot(i) print "Succeeded for %s" % i except: print "Failed for %s" % i {code} Removing the `cache()` call avoids the problem and allows wider pivots, since ColumnarIterator is specifically related to caching it does not get generated where caching is not used. This could be symptomatic of a general problem that generated code can break the 64KB bytecode limit, and so occur in other cases as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org