xuchuanyin created CARBONDATA-1281:
--------------------------------------

             Summary: Disk hotspot found during data loading
                 Key: CARBONDATA-1281
                 URL: https://issues.apache.org/jira/browse/CARBONDATA-1281
             Project: CarbonData
          Issue Type: Improvement
          Components: core, data-load
    Affects Versions: 1.1.0
            Reporter: xuchuanyin


# Scenario

Currently we have do a massive data loading. The input data is about 71GB in 
CSV format,and have about 88million records. When using carbondata, we do not 
use any dictionary encoding. Our testing environment has three nodes and each 
of them have 11 disks as yarn executor directory. We submit the loading command 
through JDBCServer.The JDBCServer instance have three executors in total, one 
on each node respectively. The loading takes about 10minutes (+-3min vary from 
each time).

We have observed the nmon information during the loading and find:

1. lots of CPU waits in the first half of loading;

2. only one single disk has many writes and almost reaches its bottleneck (Avg. 
80M/s, Max. 150M/s on SAS Disk)

3. the other disks are quite idel

# Analyze

When do data loading, carbondata read and sort data locally(default scope) and 
write the temp files to local disk. In my case, there is only one executor in 
one node, so carbondata write all the temp file to one disk(container directory 
or yarn local directory), thus resulting into single disk hotspot.

# Modification

We should support multiple directory for writing temp files to avoid disk 
hotspot.

Ps: I have improve this in my environment and the result is pretty optimistic: 
the loading takes about 6minutes (10 minutes before improving).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to