Ankit Can you try reducing number of cores or increasing memory. Because with below configuration your each core is getting ~3.5 GB. Otherwise your data is skewed, that one of cores is getting too much data based key.
spark.executor.cores 6 spark.executor.memory 36g On Sat, Sep 7, 2019 at 6:35 AM Chris Teoh <chris.t...@gmail.com> wrote: > It says you have 3811 tasks in earlier stages and you're going down to > 2001 partitions, that would make it more memory intensive. I'm guessing the > default spark shuffle partition was 200 so that would have failed. Go for > higher number, maybe even higher than 3811. What was your shuffle write > from stage 7 and shuffle read from stage 8? > > On Sat, 7 Sep 2019, 7:57 pm Ankit Khettry, <justankit2...@gmail.com> > wrote: > >> Still unable to overcome the error. Attaching some screenshots for >> reference. >> Following are the configs used: >> spark.yarn.max.executor.failures 1000 spark.yarn.driver.memoryOverhead 6g >> spark.executor.cores 6 spark.executor.memory 36g >> spark.sql.shuffle.partitions 2001 spark.memory.offHeap.size 8g >> spark.memory.offHeap.enabled true spark.executor.instances 10 >> spark.driver.memory 14g spark.yarn.executor.memoryOverhead 10g >> >> Best Regards >> Ankit Khettry >> >> On Sat, Sep 7, 2019 at 2:56 PM Chris Teoh <chris.t...@gmail.com> wrote: >> >>> You can try, consider processing each partition separately if your data >>> is heavily skewed when you partition it. >>> >>> On Sat, 7 Sep 2019, 7:19 pm Ankit Khettry, <justankit2...@gmail.com> >>> wrote: >>> >>>> Thanks Chris >>>> >>>> Going to try it soon by setting maybe spark.sql.shuffle.partitions to >>>> 2001. Also, I was wondering if it would help if I repartition the data by >>>> the fields I am using in group by and window operations? >>>> >>>> Best Regards >>>> Ankit Khettry >>>> >>>> On Sat, 7 Sep, 2019, 1:05 PM Chris Teoh, <chris.t...@gmail.com> wrote: >>>> >>>>> Hi Ankit, >>>>> >>>>> Without looking at the Spark UI and the stages/DAG, I'm guessing >>>>> you're running on default number of Spark shuffle partitions. >>>>> >>>>> If you're seeing a lot of shuffle spill, you likely have to increase >>>>> the number of shuffle partitions to accommodate the huge shuffle size. >>>>> >>>>> I hope that helps >>>>> Chris >>>>> >>>>> On Sat, 7 Sep 2019, 4:18 pm Ankit Khettry, <justankit2...@gmail.com> >>>>> wrote: >>>>> >>>>>> Nope, it's a batch job. >>>>>> >>>>>> Best Regards >>>>>> Ankit Khettry >>>>>> >>>>>> On Sat, 7 Sep, 2019, 6:52 AM Upasana Sharma, <028upasana...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Is it a streaming job? >>>>>>> >>>>>>> On Sat, Sep 7, 2019, 5:04 AM Ankit Khettry <justankit2...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I have a Spark job that consists of a large number of Window >>>>>>>> operations and hence involves large shuffles. I have roughly 900 GiBs >>>>>>>> of >>>>>>>> data, although I am using a large enough cluster (10 * m5.4xlarge >>>>>>>> instances). I am using the following configurations for the job, >>>>>>>> although I >>>>>>>> have tried various other combinations without any success. >>>>>>>> >>>>>>>> spark.yarn.driver.memoryOverhead 6g >>>>>>>> spark.storage.memoryFraction 0.1 >>>>>>>> spark.executor.cores 6 >>>>>>>> spark.executor.memory 36g >>>>>>>> spark.memory.offHeap.size 8g >>>>>>>> spark.memory.offHeap.enabled true >>>>>>>> spark.executor.instances 10 >>>>>>>> spark.driver.memory 14g >>>>>>>> spark.yarn.executor.memoryOverhead 10g >>>>>>>> >>>>>>>> I keep running into the following OOM error: >>>>>>>> >>>>>>>> org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire >>>>>>>> 16384 bytes of memory, got 0 >>>>>>>> at >>>>>>>> org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157) >>>>>>>> at >>>>>>>> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98) >>>>>>>> at >>>>>>>> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.<init>(UnsafeInMemorySorter.java:128) >>>>>>>> at >>>>>>>> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.<init>(UnsafeExternalSorter.java:163) >>>>>>>> >>>>>>>> I see there are a large number of JIRAs in place for similar issues >>>>>>>> and a great many of them are even marked resolved. >>>>>>>> Can someone guide me as to how to approach this problem? I am using >>>>>>>> Databricks Spark 2.4.1. >>>>>>>> >>>>>>>> Best Regards >>>>>>>> Ankit Khettry >>>>>>>> >>>>>>>