[ https://issues.apache.org/jira/browse/CARBONDATA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
xuchuanyin resolved CARBONDATA-1373. ------------------------------------ Resolution: Fixed > Enhance update performance in carbondata > ---------------------------------------- > > Key: CARBONDATA-1373 > URL: https://issues.apache.org/jira/browse/CARBONDATA-1373 > Project: CarbonData > Issue Type: Improvement > Components: data-load > Reporter: xuchuanyin > Assignee: xuchuanyin > Fix For: 1.2.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > # Scenario > Recently I have tested the update feature provided in Carbondata and found > its poor performance. > I had a table containing about 14 million records with about 370 columns(no > dictionary columns) and the data files are about 3.8 GB in total. All the > data files were in one segment. > I performed an update SQL which update a column for all the records and the > SQL looked like `UPDATE myTable SET (col1)=(col1+1000) WHERE TRUE`. In my > environment, the update job failed with 'executor lost errors'. And I found > 'spill data' related messages in the container logs. > # Analyze > I've read about the implementation of update-delete in Carbondata in > ISSUE#440. The update consists a delete and an insert operation. And the > error occurred during the insert operation. > After studying the code, I have found that while doing inserting, the updated > records are grouped by the `segmentId`, which means all the recoreds in one > segment will be processed in only one task, thus will cause task failure when > the amount of input data is quite large. > # Solution > We should improve the parallelism when doing update for a segment. > I append a random key to the `segmentId` to increase the partition number > before doing the insertion stage and then remove the suffix when doing the > real insertion. > I have tested in my example and the job finished in about 13 minutes > successfully. The records were updated as expected. -- This message was sent by Atlassian JIRA (v6.4.14#64029)