Spark :- Update record in partition.

2020-06-07 Thread Sunil Kalra
Hi All,

If i have to update a record in partition using spark, do i have to read
the whole partition and update the row and overwrite the partition?

Is there a way to only update 1 row like DBMS. Otherwise 1 row update takes
a long time to rewrite the whole partition ?

Thanks
Sunil


Re: OOM Error

2019-09-07 Thread Sunil Kalra
Ankit

Can you try reducing number of cores or increasing memory. Because with
below configuration your each core is getting ~3.5 GB. Otherwise your data
is skewed, that one of cores is getting too much data based key.

spark.executor.cores 6 spark.executor.memory 36g

On Sat, Sep 7, 2019 at 6:35 AM Chris Teoh  wrote:

> It says you have 3811 tasks in earlier stages and you're going down to
> 2001 partitions, that would make it more memory intensive. I'm guessing the
> default spark shuffle partition was 200 so that would have failed. Go for
> higher number, maybe even higher than 3811. What was your shuffle write
> from stage 7 and shuffle read from stage 8?
>
> On Sat, 7 Sep 2019, 7:57 pm Ankit Khettry, 
> wrote:
>
>> Still unable to overcome the error. Attaching some screenshots for
>> reference.
>> Following are the configs used:
>> spark.yarn.max.executor.failures 1000 spark.yarn.driver.memoryOverhead 6g
>> spark.executor.cores 6 spark.executor.memory 36g
>> spark.sql.shuffle.partitions 2001 spark.memory.offHeap.size 8g
>> spark.memory.offHeap.enabled true spark.executor.instances 10
>> spark.driver.memory 14g spark.yarn.executor.memoryOverhead 10g
>>
>> Best Regards
>> Ankit Khettry
>>
>> On Sat, Sep 7, 2019 at 2:56 PM Chris Teoh  wrote:
>>
>>> You can try, consider processing each partition separately if your data
>>> is heavily skewed when you partition it.
>>>
>>> On Sat, 7 Sep 2019, 7:19 pm Ankit Khettry, 
>>> wrote:
>>>
 Thanks Chris

 Going to try it soon by setting maybe spark.sql.shuffle.partitions to
 2001. Also, I was wondering if it would help if I repartition the data by
 the fields I am using in group by and window operations?

 Best Regards
 Ankit Khettry

 On Sat, 7 Sep, 2019, 1:05 PM Chris Teoh,  wrote:

> Hi Ankit,
>
> Without looking at the Spark UI and the stages/DAG, I'm guessing
> you're running on default number of Spark shuffle partitions.
>
> If you're seeing a lot of shuffle spill, you likely have to increase
> the number of shuffle partitions to accommodate the huge shuffle size.
>
> I hope that helps
> Chris
>
> On Sat, 7 Sep 2019, 4:18 pm Ankit Khettry, 
> wrote:
>
>> Nope, it's a batch job.
>>
>> Best Regards
>> Ankit Khettry
>>
>> On Sat, 7 Sep, 2019, 6:52 AM Upasana Sharma, <028upasana...@gmail.com>
>> wrote:
>>
>>> Is it a streaming job?
>>>
>>> On Sat, Sep 7, 2019, 5:04 AM Ankit Khettry 
>>> wrote:
>>>
 I have a Spark job that consists of a large number of Window
 operations and hence involves large shuffles. I have roughly 900 GiBs 
 of
 data, although I am using a large enough cluster (10 * m5.4xlarge
 instances). I am using the following configurations for the job, 
 although I
 have tried various other combinations without any success.

 spark.yarn.driver.memoryOverhead 6g
 spark.storage.memoryFraction 0.1
 spark.executor.cores 6
 spark.executor.memory 36g
 spark.memory.offHeap.size 8g
 spark.memory.offHeap.enabled true
 spark.executor.instances 10
 spark.driver.memory 14g
 spark.yarn.executor.memoryOverhead 10g

 I keep running into the following OOM error:

 org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire
 16384 bytes of memory, got 0
 at
 org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
 at
 org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
 at
 org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.(UnsafeInMemorySorter.java:128)
 at
 org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:163)

 I see there are a large number of JIRAs in place for similar issues
 and a great many of them are even marked resolved.
 Can someone guide me as to how to approach this problem? I am using
 Databricks Spark 2.4.1.

 Best Regards
 Ankit Khettry

>>>