Matthew Livesey created SPARK-16191:
---------------------------------------

             Summary: Code-Generated SpecificColumnarIterator fails for wide 
pivot with caching
                 Key: SPARK-16191
                 URL: https://issues.apache.org/jira/browse/SPARK-16191
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.6.1
            Reporter: Matthew Livesey


When caching a pivot of more than 2260 columns, the instance of 
SpecificColumnarIterator which is generated by code-generation fails to be 
compiled with:

bq. failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of 
method \"()Z\" of class 
\"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator\"
 grows beyond 64 KB

This can be re-produced in PySpark with the following (it took some trial and 
error to find that 2261 is the magic number at which the generated class breaks 
the 64KB limit).

{code}
def build_pivot(width):
    categories = ["cat_%s" % i for i in range(0,width)]
    customers = ["cust_%s" % i for i in range(0,10)]
    rows = []
    for cust in customers:
        for cat in categories:
            for i in range(0,4):
                row = (cust, cat, i, 7.0)
                rows.append(row)
    rdd = sc.parallelize(rows)
    df = sqlContext.createDataFrame(rdd, ["customer", "category", "instance", 
"value"])
    pivot_value_rows = 
df.select("category").distinct().orderBy("category").collect()
    pivot_values = [r.category for r in pivot_value_rows]
    import pyspark.sql.functions as func
    pivot = df.groupBy('customer').pivot("category", 
pivot_values).agg(func.sum(df.value)).cache()
    pivot.write.save('my_pivot', mode='overwrite')

for i in [2260, 2261]:
    try:
        build_pivot(i)
        print "Succeeded for %s" % i
    except:
        print "Failed for %s" % i
{code}

Removing the `cache()` call avoids the problem and allows wider pivots, since 
ColumnarIterator is specifically related to caching it does not get generated 
where caching is not used.

This could be symptomatic of a general problem that generated code can break 
the 64KB bytecode limit, and so occur in other cases as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to