Hi dev:  Recently, I compared the performance of full scan between parquet
and carbondata, found that the performance of full scan of carbondata was
worse than parquet.*My test:*    1. Spark 2.2 + Parquet with Spark 2.2 +
CarbonData(master branch)  2. Run on local[1] mode,   3. There are 8 parquet
files in one folder, total: 47474456 records, the size of each file is about
*170* MB;   4. There are 8 segments in one carbondata table, total: 47474456
records, each segment has one file, the size of each file is about *220 *MB,
there are *4 blocklets and 186 pages* in one carbondata file;  5. The data
of each parquet file and carbondata file is the same;  6. create table sql: 
7. test sql:     1). select count(chan),count(fcip),sum(size) from table;   
2). select chan,fcip,sum(size) from table group by chan, fcip order by chan,
fcip;*Test result:**  SQL1:    Parquet:          4s       4s       4s   
CarbonData:      12s      11s      12s  SQL2:    Parquet:         11s     
10s      11s    CarbonData:      18s      18s      19s**Analyse:*  I added
some time count in code and change the size of CarbonVectorProxy from 4 *
1024 to 32 * 1024, use non-prefetch mode.  The time stat (take one test) :   
1. BlockletFullScanner.readBlocklet:  169ms;   2.
BlockletFullScanner.scanBlocklet:  176ms;   3.
DictionaryBasedVectorResultCollector.collectResultInColumnarBatch: 7958ms,
in this part, it takes about 200-300ms to handle each blocklet, so it takes
totally about 1s to handle one carbondata file, but in carbon stat log it
shows that it takes about 1-2s to handle one carbondata file for SQL1 and
2-3s to handle one file for SQL2;   4. In CarbonScanRDD.internalCompute, the
iterator will execute 1464 times, each iterate takes about 8-9ms for SQL1
and 10-15ms for SQL2;   5. The total time of 1-3 steps are almost the same
for SQL1 and SQL2;*Questions:*  1. any optimization on
DictionaryBasedVectorResultCollector.collectResultInColumnarBatch ?  2. It
takes about 1s to handle one carbondata file in my time stat, but actually
it takes about 1-2s for SQL1 and 2-3s for SQL2 in Spark ui, why? shuffle?
compute?  3. Can it support to configurate the size of CarbonVectorProxy to
reduce times of iterate? Default value is 4 * 1024 and iterate executes
11616 times.BTW, if the optimization(this mailling thread 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html>
  
mentions) is done, I will use this case to test again.Any feedback is
welcome.



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Reply via email to