[ 
https://issues.apache.org/jira/browse/SPARK-44679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774963#comment-17774963
 ] 

huangsheng edited comment on SPARK-44679 at 10/13/23 4:07 PM:
--------------------------------------------------------------

I think your problem is the same as SPARK-40622

We have seen the same error before. and the exception was thrown when the 
executor was serializing the result data.
{code:java}
2023-06-09T14:59:02,752 ERROR [Executor task launch worker for task 9.0 in 
stage 22.0 (TID 218)] executor.Executor : Exception in task 9.0 in stage 22.0 
(TID 218)
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
        at java.util.Arrays.copyOf(Arrays.java:3236) ~[?:1.8.0_25]
        at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) 
~[?:1.8.0_25]
        at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
~[?:1.8.0_25]
        at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) 
~[?:1.8.0_25]
        at 
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
 ~[spark-core]
        at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
 ~[?:1.8.0_25]
        at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
 ~[?:1.8.0_25]
        at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) 
~[?:1.8.0_25]
        at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) 
~[?:1.8.0_25]
        at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) 
~[?:1.8.0_25]
        at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
~[?:1.8.0_25]
        at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) 
~[?:1.8.0_25]
        at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) 
~[?:1.8.0_25]
        at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
 ~[spark-core]
        at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
 ~[spark-core]
        at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) 
~[spark-core]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
~[?:1.8.0_25]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
~[?:1.8.0_25]
        at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_25] {code}
 

The version we were using at the time was 3.2.x. Since we could not upgrade 
spark to 3.4 at that time, we decided to cherrypick the 
[commit|https://github.com/apache/spark/pull/38064/files#diff-d7a989c491f3cb77cca02c701496a9e2a3443f70af73b0d1ab0899239f3a789d]
 in the pr SPARK-40622 to our own spark branch. Finally, we used the same 
scenario to test and found that the exception did not occur.


was (Author: JIRAUSER297438):
I think your problem is the same as SPARK-40622

We have seen the same error before. and the exception was thrown when the 
executor was serializing the result data.

The version we were using at the time was 3.2.x. Since we could not upgrade 
spark to 3.4 at that time, we decided to cherrypick the 
[commit|https://github.com/apache/spark/pull/38064/files#diff-d7a989c491f3cb77cca02c701496a9e2a3443f70af73b0d1ab0899239f3a789d]
 in the pr SPARK-40622 to our own spark branch. Finally, we used the same 
scenario to test and found that the exception did not occur.

> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> -----------------------------------------------------------------
>
>                 Key: SPARK-44679
>                 URL: https://issues.apache.org/jira/browse/SPARK-44679
>             Project: Spark
>          Issue Type: Bug
>          Components: EC2, PySpark
>    Affects Versions: 3.2.1
>         Environment: We use Amazon EMR to run Pyspark jobs.
> Amazon EMR version : emr-6.7.0
> Installed applications : 
> Tez 0.9.2, Spark 3.2.1, Hive 3.1.3, Sqoop 1.4.7, Hadoop 3.2.1, Zookeeper 
> 3.5.7, HCatalog 3.1.3, Livy 0.7.1
>            Reporter: Haitham Eltaweel
>            Priority: Major
>         Attachments: code_sample.txt
>
>
> We get the following error from our Pyspark application in Production env:
> _java.lang.OutOfMemoryError: Requested array size exceeds VM limit_
> I simplified the code we used and shared it below so you can easily 
> investigate the issue.
> We use Pyspark to read 900 MB text file which has one record. We use foreach 
> function to iterate over the Datafreme and apply some high order function. 
> The error occurs once foreach action is triggered. I think the issue is 
> related to the integer data type of the bytes array used to hold the 
> serialized dataframe. Since the dataframe record was too big, it seems the 
> serialized record exceeded the max integer value, hence the error occurred. 
> Note that the same error happens when using foreachBatch function with 
> writeStream. 
> Our prod data has many records larger than 100 MB.  Appreciate your help to 
> provide a fix or a solution to that issue.
>  
> *Find below the code snippet:*
> from pyspark.sql import SparkSession,functions as f
>  
> def check_file_name(row):
>     print("check_file_name called")
>  
> def main():
>     spark=SparkSession.builder.enableHiveSupport().getOrCreate()
> inputPath = "s3://bucket-name/common/source/"
>     inputDF = spark.read.text(inputPath, wholetext=True)
>     inputDF = inputDF.select(f.date_format(f.current_timestamp(), 
> 'yyyyMMddHH').astype('string').alias('insert_hr'),
>                         f.col("value").alias("raw_data"),
>                         f.input_file_name().alias("input_file_name"))
>     inputDF.foreach(check_file_name)
>  
> if __name__ == "__main__":
>     main()
> *Find below spark-submit command used:*
> spark-submit --master yarn --conf 
> spark.serializer=org.apache.spark.serializer.KryoSerializer  --num-executors 
> 15 --executor-cores 4 --executor-memory 20g --driver-memory 20g --name 
> haitham_job --deploy-mode cluster big_file_process.py



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to