GitHub user rahulforallp opened a pull request: https://github.com/apache/carbondata/pull/2434
[CARBONDATA-2625] Optimize the performance of CarbonReader read many files REf : https://github.com/apache/carbondata/pull/2391 About the issue: it's timeout and no result in 8 minutes when read more than 10 million data with 140 files, Even though increase 200000 rows for each carbon Writer and it can reduce the index files and data files when the number of rows is 13000000, but when there are more than 1 billion or more, the number of files still still many. I check the code and find read more 140 files can be optimize: In the cache.getAll, the IO is more than 140 if there are 140 carbon files, in fact, the IO are more than 70 * 140 times, it's slow and can be optimized Secondly, there are some duplicate operate in getDataMaps and can be optimized Thirdly, SDK need much time to create multiple carbonRecorderReader, it need more than 8 minutes by testing 150 files and 15million rows data when create more than 16 carbonReorederReader if the machine has 8 cores . It can be optimized By optimizing the three points,including cache.getAll, getDatamaps and create carbonRecordReader, now SDK can work for reading 150 files and 15million rows data in 8 minutes, it need about 340 seconds by testing. One case: 150 files , each file has 200000 rows, total rows is 15000000 Finished write data time: 449.102 s Finished build reader time:192.596 s Read first row time: 192.597 s, including build reader Read time:341.556 s, including build reader Another case: 15 files , each file has 2000000 rows, total rows is 15000000 Finished write data time: 286.907 s Finished build reader time: 134.665 s Read first row time: 134.666 s, including build reader Finished read, the count of rows is:15000000 Read time:156.427 s, including build reader Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: Any interfaces changed? Yes, add new one for optimizing performance Any backward compatibility impacted? NA Document update required? NO Testing done add example for it For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NO You can merge this pull request into a Git repository by running: $ git pull https://github.com/rahulforallp/incubator-carbondata xuboPRsynch2391 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2434.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2434 ---- commit 28a0b0f40c45967e586d7a5e703dce3cfaa48c99 Author: xubo245 <601450868@...> Date: 2018-06-21T04:25:27Z [CARBONDATA-2625] Optimize the performance of CarbonReader read many files optimize the build process, including cache.getAll, getDatamaps and create carbonRecordReader fix CI error add config to change the carbonreader thread number for SDKDetailQueryExecutor optimize optimize try to fix sdv error optimize optimize fix fix again optimize commit 9d1c825768cce1ca7e5d0f0aa9eb354ef166e2c9 Author: xubo245 <xubo29@...> Date: 2018-06-30T02:40:45Z optimize commit 69210f8ac7e64ed8a5c6a0c0a586e0cf8fc95812 Author: xubo245 <xubo29@...> Date: 2018-06-30T02:53:31Z remove unused import commit ac3f70c081171eaab0163f6b89901117759d9fdf Author: xubo245 <xubo29@...> Date: 2018-06-30T08:33:52Z optimize commit 9306daea8be158a6cdfed2387fd94100ceca13ca Author: rahul <rahul.kumar@...> Date: 2018-07-02T05:37:51Z removed unnecessary properties ---- ---