[ 
https://issues.apache.org/jira/browse/SPARK-22284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16210089#comment-16210089
 ] 

Ben commented on SPARK-22284:
-----------------------------

[~hyukjin.kwon], thank you very much for your assistance, but I don't think I 
can reproduce it easily.
Everything has been working fine until now, but it only got stuck in the last 
batch.

If 'm correct, you are also the developer of spark-xml, which is exactly what I 
use for reading the source files.
I read more than a million XML files per batch, and these files have complex 
nested structures, so a lot of data is parsed into a DataFrame.
There are cases that the XML files may have very deep levels of nested tags, so 
I'm not sure if that may be the cause, linking this issue to spark-xml and not 
the actual join step when the error happens.

Furthermore, I tried splitting the batch where it got stuck into smaller 
batches to process. The first half of around 700 thousand files worked, but the 
second half had the same error. Then I split the second half in 100 thousand 
files per batch, and then it worked. I cannot make the XML files available, 
which means I can't give you a way to reproduce it. I can send you the 
generated code that is included in the error, but I'm guessing that also 
doesn't help. However, if you think the deep nested levels may be a cause, 
maybe this can provide some insight.

> Code of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-22284
>                 URL: https://issues.apache.org/jira/browse/SPARK-22284
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer, PySpark, SQL
>    Affects Versions: 2.1.0
>            Reporter: Ben
>
> I am using pySpark 2.1.0 in a production environment, and trying to join two 
> DataFrames, one of which is very large and has complex nested structures.
> Basically, I load both DataFrames and cache them.
> Then, in the large DataFrame, I extract 3 nested values and save them as 
> direct columns.
> Finally, I join on these three columns with the smaller DataFrame.
> This would be a short code for this:
> {code}
> dataFrame.read......cache()
> dataFrameSmall.read.......cache()
> dataFrame = dataFrame.selectExpr(['*','nested.Value1 AS 
> Value1','nested.Value2 AS Value2','nested.Value3 AS Value3'])
> dataFrame = dataFrame.dropDuplicates().join(dataFrameSmall, 
> ['Value1','Value2',Value3'])
> dataFrame.count()
> {code}
> And this is the error I get when it gets to the count():
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in 
> stage 7.0 failed 4 times, most recent failure: Lost task 11.3 in stage 7.0 
> (TID 11234, somehost.com, executor 10): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> \"apply_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V\"
>  of class 
> \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\"
>  grows beyond 64 KB
> {code}
> I have seen many tickets with similar issues here, but no proper solution. 
> Most of the fixes are until Spark 2.1.0 so I don't know if running it on 
> Spark 2.2.0 would fix it. In any case I cannot change the version of Spark 
> since it is in production.
> I have also tried setting 
> {code:java}
> spark.sql.codegen.wholeStage=false
> {code}
>  but still the same error.
> The job worked well up to now, also with large datasets, but apparently this 
> batch got larger, and that is the only thing that changed. Is there any 
> workaround for this?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to