GitHub user xuchuanyin opened a pull request: https://github.com/apache/carbondata/pull/1632
[CARBONDATA-1839] [DataLoad]Fix bugs in compressing sort temp files Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [X] Any interfaces changed? `YES, ONLY CHANGE INTERNAL INTERFACES` - [X] Any backward compatibility impacted? `NO` - [X] Document update required? `YES` - [X] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? `ADDED TESTS` - How it is tested? Please attach test report. `TESTED IN LOCAL CLUSTER` - Is it a performance related change? Please attach the performance test report. `YES` - Any additional information to help reviewers in testing this change. `There are some duplicate code in write temp sort files found during this bug fixing and I plan to optimize it in successive PR not in this one.` - [X] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. `NOT RELATED` RESOLVE === 1. Fix bugs in compressing sort temp file 2. Reduce duplicate code in reading & writing sort temp file and make it more readable 3. Optimize sort procedure: Before: ```flow st=>start: raw row that has been converted(call it 'RawRow' for short) e=>end: write 'PartedRow' to DataFile in write procedure op1=>operation: read RawRow from temp sort file op2=>operation: sort on RawRow op3=>operation: write RawRow to temp sort file cond=>condition: final sort? op4=>operation: sort on RawRow op5=>operation: convert each RawRow to 3 'PartedRow' st->op1->op2->op3->cond cond(no)->op1 cond(yes)->op4->op5->e ``` Afterï¼ ```flow st=>start: raw row that has been converted(call it 'RawRow' for short) e=>end: write 'PartedRow' to DataFile in write procedure op1=>operation: convert RawRow to 3 'PartedRow' op2=>operation: read PartedRow from temp sort file op3=>operation: sort on PartedRow op4=>operation: write PartedRow to temp sort file cond=>condition: final sort? op5=>operation: sort on PartedRow st->op1->op2->op3->op4->cond cond(no)->op2 cond(yes)->op5->e ``` 4. Add tests to enable sort_temp_file_compressed while doing data loading You can merge this pull request into a Git repository by running: $ git pull https://github.com/xuchuanyin/carbondata bug_sort_temp_compress_1207 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/1632.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1632 ---- commit fb46e1288ae3150700a6508298f1ec9dcc8d37c2 Author: xuchuanyin <xuchuan...@huawei.com> Date: 2017-12-07T08:31:58Z Fix bugs in compressing sort temp file 1. fix bugs in compressing sort temp file 2. reduce duplicate code in reading & writing sort temp file and make it more readable 3. optimize sort procedure: Before: raw row that has been converted(call it 'RawRow' for short) -> sort on RawRow -> write RawRow to temp sort file -> read RawRow from temp sort file -> sort on RawRow -> ... -> at the final sort, sort on RawRow and convert the RawRow to 3 'PartedRow' -> write 'PartedRow' to DataFile in write procedure. After: raw row that has been converted(call it 'RawRow' for short) -> convert RawRow to 3 'PartedRow' -> sort on PartedRow -> write PartedRow to temp sort file -> read PartedRow from temp sort file -> sort on PartedRow -> ... -> at the final sort, sort on PartedRow -> write 'PartedRow' to DataFile in write procedure. 4. add tests ---- ---