Hi jacky
One question : Can you explain that proposed CarbonData Storage Service would store what information? For users how to pre-configure memory resource for the service? as big as possible memory? -------------------------------------------------------------------------------------------------------- while CarbonData requires its own memory cache. Regards Liang 2017-05-14 0:19 GMT-04:00 Jacky Li <jacky.li...@qq.com>: > Hi community, > > Partition feature is proposed by Cao Lu in thread ( > http://apache-carbondata-dev-mailing-list-archive.1130556. > n5.nabble.com/Discussion-Implement-Partition-Table- > Feature-td10938.html#a11321 <http://apache-carbondata-dev- > mailing-list-archive.1130556.n5.nabble.com/Discussion- > Implement-Partition-Table-Feature-td10938.html#a11321>), implementation > effort is on going. > > After partition is implemented, point query using sort columns is expected > to be faster than current B-Tree index approach. To further boost its > performance and achieve higher concurrency, I want to discuss to provide a > service for CarbonData. > > Following is the proposal: > > CarbonData Storage Service > At the moment, CarbonData project mainly defines a columnar format with > index support. These CarbonData files are read and write in process > framework (like in spark executor), they are efficient for > OLAP/DataWarehouse kind of workload, however, there are overheads for > simple query like point queries. For example, in spark, DAG break down, > Task scheduling, Task serialization/deserialization is inevitable. > Furthermore, executor memory is meant for control by spark core, while > CarbonData requires its own memory cache. > > So, in order to improve on it, I suggest to add a Storage Service in > CarbonData project. The main goal of this service is to serve point query > and manage carbon data storage. > > 1. Deployment > This service can be embedded in process framework (spark executor) like > current way, or deploy a new self-managed process in HDFS data node. For > latter approach, we can implement a YARN application to manage these > processes. > > 2. Communication > There will be service client communicate with service. One simple approach > is re-use the current netty RPC framework we have for dictionary generation > in single-pass loading. We need to add configure for RPC ports for this > service. > > 3. Functionality > I can think of a few functionalities that this service can provide, you > can suggest more. > 1) Serving point query > The query filter is consist of PARTITION_COLUMN and SORT_COLUMN, > the client will send a RPC request to the service, the service open the > request file and locate the offset by SORT_COLUMN and start scanning. The > reading of CarbonData remains no change as in current CarbonData > RecordReader. Once result data is collected, return it through RPC > response to the client. > By optimizing client and service side handling and its payload in > RPC, it should be more efficient than spark Task. > > 2) Cache management > Currently, CarbonData caches file level index in spark executor, > this is not desired especially dynamic allocation is enabled in spark. By > adding this Storage Service, CarbonData can have better management of this > cache inside its own memory space. Besides index cache, we can also > consider to add cache for hot block/blocklet, so that further reducing IO > and latency. > > 3) Compaction management > Now, SORT_COLUMN keyword is planned for CarbonData 1.2 and user > can use it to force NO SORT for the table to make loading faster. And there > is option for BATCH_SORT also. By adding this service, we can implement > some policy in the service to trigger compaction to do larger scope sorting > than its initial loading. > > We may identify and add more functionality in this service in the future. > > How do you think about this idea? > > Regards, > Jacky > > > -- Regards Liang