[DISCUSSION] CarbonData storage service

Jacky Li Sat, 13 May 2017 21:19:45 -0700

Hi community,

Partition feature is proposed by Cao Lu in thread 
(http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Implement-Partition-Table-Feature-td10938.html#a11321
 
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Implement-Partition-Table-Feature-td10938.html#a11321>),
 implementation effort is on going.


After partition is implemented, point query using sort columns is expected to 
be faster than current B-Tree index approach. To further boost its performance 
and achieve higher concurrency, I want to discuss to provide a service for 
CarbonData.

Following is the proposal:

CarbonData Storage Service
At the moment, CarbonData project mainly defines a columnar format with index 
support. These CarbonData files are read and write in process framework (like 
in spark executor), they are efficient for OLAP/DataWarehouse kind of workload, 
however, there are overheads for simple query like point queries. For example, 
in spark, DAG break down, Task scheduling, Task serialization/deserialization 
is inevitable. Furthermore, executor memory is meant for control by spark core, 
while CarbonData requires its own memory cache. 

So, in order to improve on it, I suggest to add a Storage Service in CarbonData 
project. The main goal of this service is to serve point query and manage 
carbon data storage. 

1. Deployment
This service can be embedded in process framework (spark executor) like current 
way, or deploy a new self-managed process in HDFS data node. For latter 
approach, we can implement a YARN application to manage these processes. 

2. Communication
There will be service client communicate with service. One simple approach is 
re-use the current netty RPC framework we have for dictionary generation in 
single-pass loading. We need to add configure for RPC ports for this service.

3. Functionality
I can think of a few functionalities that this service can provide, you can 
suggest more. 
        1) Serving point query
        The query filter is consist of PARTITION_COLUMN and SORT_COLUMN, the 
client will send a RPC request to the service, the service open the request 
file and locate the offset by SORT_COLUMN and start scanning. The reading of 
CarbonData remains no change as in current CarbonData RecordReader. Once result 
data is collected, return it  through RPC response to the client.
        By optimizing client and service side handling and its payload in RPC, 
it should be more efficient than spark Task.

        2) Cache management
        Currently, CarbonData caches file level index in spark executor, this 
is not desired especially dynamic allocation is enabled in spark. By adding 
this Storage Service, CarbonData can have better management of this cache 
inside its own memory space. Besides index cache, we can also consider to add 
cache for hot block/blocklet, so that further reducing IO and latency.

        3) Compaction management
        Now, SORT_COLUMN keyword is planned for CarbonData 1.2 and user can use 
it to force NO SORT for the table to make loading faster. And there is option 
for BATCH_SORT also. By adding this service, we can implement some policy in 
the service to trigger compaction to do larger scope sorting than its initial 
loading.

We may identify and add more functionality in this service in the future.

How do you think about this idea?

Regards,
Jacky

[DISCUSSION] CarbonData storage service

Reply via email to