[ 
https://issues.apache.org/jira/browse/CARBONDATA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuchuanyin resolved CARBONDATA-1373.
------------------------------------
    Resolution: Fixed

> Enhance update performance in carbondata
> ----------------------------------------
>
>                 Key: CARBONDATA-1373
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-1373
>             Project: CarbonData
>          Issue Type: Improvement
>          Components: data-load
>            Reporter: xuchuanyin
>            Assignee: xuchuanyin
>             Fix For: 1.2.0
>
>          Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> # Scenario
> Recently I have tested the update feature provided in Carbondata and found 
> its poor performance.
> I had a table containing about 14 million records with about 370 columns(no 
> dictionary columns) and the data files are about 3.8 GB in total. All the 
> data files were in one segment.
> I performed an update SQL which update a column for all the records and the 
> SQL looked like `UPDATE myTable SET (col1)=(col1+1000) WHERE TRUE`. In my 
> environment, the update job failed with 'executor lost errors'. And I found 
> 'spill data' related messages in the container logs.
> # Analyze
> I've read about the implementation of update-delete in Carbondata in 
> ISSUE#440. The update consists a delete and an insert operation. And the 
> error occurred during the insert operation.
> After studying the code, I have found that while doing inserting, the updated 
> records are grouped by the `segmentId`, which means all the recoreds in one 
> segment will be processed in only one task, thus will cause task failure when 
> the amount of input data is quite large.
> # Solution
> We should improve the parallelism when doing update for a segment.
> I append a random key to the `segmentId` to increase the partition number 
> before doing the insertion stage and then remove the suffix when doing the 
> real insertion.
> I have tested in my example and the job finished in about 13 minutes 
> successfully. The records were updated as expected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to