[
https://issues.apache.org/jira/browse/SPARK-22284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ben updated SPARK-22284:
------------------------
Attachment: 64KB Error.log
Sure, I just added it as an attachment.
> Code of class
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
> grows beyond 64 KB
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-22284
> URL: https://issues.apache.org/jira/browse/SPARK-22284
> Project: Spark
> Issue Type: Bug
> Components: Optimizer, PySpark, SQL
> Affects Versions: 2.1.0
> Reporter: Ben
> Attachments: 64KB Error.log
>
>
> I am using pySpark 2.1.0 in a production environment, and trying to join two
> DataFrames, one of which is very large and has complex nested structures.
> Basically, I load both DataFrames and cache them.
> Then, in the large DataFrame, I extract 3 nested values and save them as
> direct columns.
> Finally, I join on these three columns with the smaller DataFrame.
> This would be a short code for this:
> {code}
> dataFrame.read......cache()
> dataFrameSmall.read.......cache()
> dataFrame = dataFrame.selectExpr(['*','nested.Value1 AS
> Value1','nested.Value2 AS Value2','nested.Value3 AS Value3'])
> dataFrame = dataFrame.dropDuplicates().join(dataFrameSmall,
> ['Value1','Value2',Value3'])
> dataFrame.count()
> {code}
> And this is the error I get when it gets to the count():
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in
> stage 7.0 failed 4 times, most recent failure: Lost task 11.3 in stage 7.0
> (TID 11234, somehost.com, executor 10):
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method
> \"apply_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V\"
> of class
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
> grows beyond 64 KB
> {code}
> I have seen many tickets with similar issues here, but no proper solution.
> Most of the fixes are until Spark 2.1.0 so I don't know if running it on
> Spark 2.2.0 would fix it. In any case I cannot change the version of Spark
> since it is in production.
> I have also tried setting
> {code:java}
> spark.sql.codegen.wholeStage=false
> {code}
> but still the same error.
> The job worked well up to now, also with large datasets, but apparently this
> batch got larger, and that is the only thing that changed. Is there any
> workaround for this?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]