GitHub user rahulforallp opened a pull request:

    https://github.com/apache/carbondata/pull/2434

    [CARBONDATA-2625] Optimize the performance of CarbonReader read many files

    
    REf : https://github.com/apache/carbondata/pull/2391
    
    About the issue: it's timeout and no result in 8 minutes when read more 
than 10 million data with 140 files, Even though increase 200000 rows for each 
carbon Writer and it can reduce the index files and data files when the number 
of rows is 13000000, but when there are more than 1 billion or more, the number 
of files still still many. I check the code and find read more 140 files can be 
optimize:
    
    In the cache.getAll, the IO is more than 140 if there are 140 carbon files, 
in fact, the IO are more than 70 * 140 times, it's slow and can be optimized
    
    Secondly, there are some duplicate operate in getDataMaps and can be 
optimized
    
    Thirdly, SDK need much time to create multiple carbonRecorderReader, it 
need more than 8 minutes by testing 150 files and 15million rows data when 
create more than 16 carbonReorederReader if the machine has 8 cores . It can be 
optimized
    
    By optimizing the three points,including cache.getAll, getDatamaps and 
create carbonRecordReader, now SDK can work for reading 150 files and 15million 
rows data in 8 minutes, it need about 340 seconds by testing.
    
    One case: 150 files , each file has 200000 rows, total rows is 15000000
    Finished write data time: 449.102 s
    Finished build reader time:192.596 s
    Read first row time: 192.597 s, including build reader
    Read time:341.556 s, including build reader
    
    Another case: 15 files , each file has 2000000 rows, total rows is 15000000
    Finished write data time: 286.907 s
    Finished build reader time: 134.665 s
    Read first row time: 134.666 s, including build reader
    Finished read, the count of rows is:15000000
    Read time:156.427 s, including build reader
    
    Be sure to do all of the following checklist to help us incorporate
    your contribution quickly and easily:
    
        Any interfaces changed?
        Yes, add new one for optimizing performance
    
        Any backward compatibility impacted?
        NA
    
        Document update required?
        NO
    
        Testing done
        add example for it
    
        For large changes, please consider breaking it into sub-tasks under an 
umbrella JIRA.
        NO


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rahulforallp/incubator-carbondata 
xuboPRsynch2391

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2434.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2434
    
----
commit 28a0b0f40c45967e586d7a5e703dce3cfaa48c99
Author: xubo245 <601450868@...>
Date:   2018-06-21T04:25:27Z

    [CARBONDATA-2625] Optimize the performance of CarbonReader read many files
    
    optimize the build process, including cache.getAll, getDatamaps and create 
carbonRecordReader
    
    fix CI error
    
    add config to change the carbonreader thread number for 
SDKDetailQueryExecutor
    
    optimize
    
    optimize
    
    try to fix sdv error
    
    optimize
    
    optimize
    
    fix
    
    fix again
    
    optimize

commit 9d1c825768cce1ca7e5d0f0aa9eb354ef166e2c9
Author: xubo245 <xubo29@...>
Date:   2018-06-30T02:40:45Z

    optimize

commit 69210f8ac7e64ed8a5c6a0c0a586e0cf8fc95812
Author: xubo245 <xubo29@...>
Date:   2018-06-30T02:53:31Z

    remove unused import

commit ac3f70c081171eaab0163f6b89901117759d9fdf
Author: xubo245 <xubo29@...>
Date:   2018-06-30T08:33:52Z

    optimize

commit 9306daea8be158a6cdfed2387fd94100ceca13ca
Author: rahul <rahul.kumar@...>
Date:   2018-07-02T05:37:51Z

    removed unnecessary properties

----


---

Reply via email to