Hi community, Partition feature is proposed by Cao Lu in thread (http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Implement-Partition-Table-Feature-td10938.html#a11321 <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Implement-Partition-Table-Feature-td10938.html#a11321>), implementation effort is on going.
After partition is implemented, point query using sort columns is expected to be faster than current B-Tree index approach. To further boost its performance and achieve higher concurrency, I want to discuss to provide a service for CarbonData. Following is the proposal: CarbonData Storage Service At the moment, CarbonData project mainly defines a columnar format with index support. These CarbonData files are read and write in process framework (like in spark executor), they are efficient for OLAP/DataWarehouse kind of workload, however, there are overheads for simple query like point queries. For example, in spark, DAG break down, Task scheduling, Task serialization/deserialization is inevitable. Furthermore, executor memory is meant for control by spark core, while CarbonData requires its own memory cache. So, in order to improve on it, I suggest to add a Storage Service in CarbonData project. The main goal of this service is to serve point query and manage carbon data storage. 1. Deployment This service can be embedded in process framework (spark executor) like current way, or deploy a new self-managed process in HDFS data node. For latter approach, we can implement a YARN application to manage these processes. 2. Communication There will be service client communicate with service. One simple approach is re-use the current netty RPC framework we have for dictionary generation in single-pass loading. We need to add configure for RPC ports for this service. 3. Functionality I can think of a few functionalities that this service can provide, you can suggest more. 1) Serving point query The query filter is consist of PARTITION_COLUMN and SORT_COLUMN, the client will send a RPC request to the service, the service open the request file and locate the offset by SORT_COLUMN and start scanning. The reading of CarbonData remains no change as in current CarbonData RecordReader. Once result data is collected, return it through RPC response to the client. By optimizing client and service side handling and its payload in RPC, it should be more efficient than spark Task. 2) Cache management Currently, CarbonData caches file level index in spark executor, this is not desired especially dynamic allocation is enabled in spark. By adding this Storage Service, CarbonData can have better management of this cache inside its own memory space. Besides index cache, we can also consider to add cache for hot block/blocklet, so that further reducing IO and latency. 3) Compaction management Now, SORT_COLUMN keyword is planned for CarbonData 1.2 and user can use it to force NO SORT for the table to make loading faster. And there is option for BATCH_SORT also. By adding this service, we can implement some policy in the service to trigger compaction to do larger scope sorting than its initial loading. We may identify and add more functionality in this service in the future. How do you think about this idea? Regards, Jacky