Re: [DISCUSSION] Distributed Index Cache Server
Hi xuchuanyin, I have uploaded the version2 of the design document with the desired changes. Please review and let me know if anything is missing or needs to be changed. Thanks Kunal Kapoor On Mon, Feb 18, 2019 at 12:15 PM Kunal Kapoor wrote: > Hi xuchuanyin, > I will expose an interface and the put the same in the design document > soon. > > Thanks for the feedback > Kunal Kapoor > > > On Wed, Feb 13, 2019 at 8:04 PM ChuanYin Xu > wrote: > >> Hi kunal, I think we can go further for 2.3 & 4. >> >> For 4, I think all functions of IndexServer should be in an individual >> module. We can think of the IndexServer as an enhancement component for >> Carbondata. And inside that module we handle the actual pruning logic. On >> the other side, if we do not have this component, there will be no pruning >> at all. >> >> As a consequence, for 2.3, I think the IndexServer should provide >> interfaces that will provide pruning services. For example it accepts >> expressions and returns pruning result. >> >> I think only in this way can the IndexServer be more extensible to meet >> higher requirements. >> >>
[Discussion] How to pass some options into Insert Into command
Hi all, For data loading, we can pass some options into load data command by using options clause, but insert into command can't. How to pass some options into Insert Into command? some options as following. 1. implement options clause for insert into command 2. use hint 3. set key=value 4. other methods to implement the same result, for example, use "clustered by random(2) " to implement "GLOBAL_SORT_PARTITIONS"="2" 5. ?? any suggestion? - Best Regards David Cai -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Re: [DISCUSSION] Support OBSFS
Hi Manish, May i know which version of hadoop will provide OBSFS support. Is this file system already supported in recent hadoop releases? Thanks Sujith. On Tue, 19 Feb 2019 at 4:18 PM, manish nalla wrote: > Hi all, > > OBS is an Object-based Storage Service developed and maintained by > HuaweiCloud. It provides large storage capacity and is capable of storing > any type of file. OBS supports both S3-client and OBS-client to connect to > the OBS server. > > Currently in CarbonData we are supporting HDFS and S3 as FileSystems and I > am proposing to support OBSFS as another FileSystem because of a few > drawbacks of S3FileSystem. > > CarbonData needs OBSFS instead of S3 because of two main reasons: > 1. Append: While doing append first we have to read the whole object and > then append to the object which is quite slow. > 2. Atomic Rename: No atomic rename is there in S3 as also mentioned in Jira > [CARBONDATA-2670]. > > So both these issues can be fixed if we use OBSFileSystem. > > Any suggestions from the community will be greatly appreciated. I would be > uploading the design doc shortly. > > Thanks and regards > Manish Nalla > EI BigData Kernel, > Huawei Technologies India Pvt. Ltd >
Re: [DISCUSSION] Support OBSFS
Hi Manish Nalla, Thanks for proposing this feature. Please clarify me the below points. 1. What will be the grammar to store carbondata files to OBSFS? 2. As S3 does not support Concurrent Data Manipulation operations and file leasing mechanisms, will this be the same behavior for OBSFS? 3. What are the Authentication properties that has to be configured to store carbondata files on to OBSFS location? Hope all these points will be covered in design document. Regards, Indhumathi M -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Re: [Discussion] DDLs to operate on CarbonLRUCache
Hi naman, Thanks for proposing the feature. Looks really helpful from user and developer perspective. Basically needed the design document, so that all the doubts would be cleared. 1. Basicaly how you are going to handle the sync issues like, multiple queries with drop and show cache. are you going to introduce any locking mechanism? 2. if user clears cache during query? how it is noing to behave? is it allowed or concurrent operaion is blocked? 3. How it os gonna work with Distributed index server, and types of it like embedded, presto and other and local server? basically what is the impact with that? 4. You said you will launch a job to get the size from all the blocks present. Currently we create the block or blocklet datamap, and calcute each datamap size and then we add to cache based on the lru cach esize configured. So wanted to know how you will be calculating the size in your case Regards, Akash -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
[DISCUSSION] Support OBSFS
Hi all, OBS is an Object-based Storage Service developed and maintained by HuaweiCloud. It provides large storage capacity and is capable of storing any type of file. OBS supports both S3-client and OBS-client to connect to the OBS server. Currently in CarbonData we are supporting HDFS and S3 as FileSystems and I am proposing to support OBSFS as another FileSystem because of a few drawbacks of S3FileSystem. CarbonData needs OBSFS instead of S3 because of two main reasons: 1. Append: While doing append first we have to read the whole object and then append to the object which is quite slow. 2. Atomic Rename: No atomic rename is there in S3 as also mentioned in Jira [CARBONDATA-2670]. So both these issues can be fixed if we use OBSFileSystem. Any suggestions from the community will be greatly appreciated. I would be uploading the design doc shortly. Thanks and regards Manish Nalla EI BigData Kernel, Huawei Technologies India Pvt. Ltd
Re: [Discussion] DDLs to operate on CarbonLRUCache
Hi Naman, Thanks for proposing this feature, seems to be pretty interesting feature, few points i want to bring up here 1) I think we shall require a detailed design for this feature where all the DDL's you are going to expose shall be clearly mentioned as frequent updation of DDL's are not recommended in future. Better you can also cover the scenarios which can impact your DDL operation like cross session operation of DDL's eg: one user is trying to clear the cache/table and another user will execute show cache command. basically you can also mention how you will handle all the synchronization scenarios. 2) Already Spark has exposed DDL's for clearing the caches as below, please refer the same and try to get more insights about this DDL.Better to follow a standard syntax. "CLEAR CACHE" "UNCACHE TABLE (IF EXISTS)? tableIdentifier" 3) How you will deal with drop table case, i think you shall clear the respective cache also. mention these scenarios clearly in your design document. 4) 0 for point 5, as i think you need to explain more on your design document about the scenarios and the need of this feature, this ddl can bring up more complexities to the system eg: By the time system calculate the table size a new segment can get added or an existing segment can get modified. so basically again you need to go for a lock so that these kind of synchronization issues can be tackle in better manner. Overall i think the approach shall be well documented before you can start with implementation. Please let me know for any clarifications or suggestions regarding above points. Regards, Sujith On Mon, Feb 18, 2019 at 3:35 PM Naman Rastogi wrote: > Hi all, > > Currently carbon supports caching mechanism for Blocks/Blocklets. Even > though it allows end user to set the Cache size, it is still very > limited in functionality, and user arbitrarily chooses the carbon > property *carbon.max.driver.lru.cache.size* where before launching the > carbon session, he/she has no idea of how much cache should be set for > his/her requirement. > > For this problem, I propose the following imporvements in carbon caching > mechanism. > > 1. Support DDL for showing current cache used per table. > 2. Support DDL for showing current cache used for a particular table. > For these two points, QiangCai has already has a PR: > https://github.com/apache/carbondata/pull/3078 > > 3. Support DDL for clearing all the entries cache. > This will look like: > CLEAN CACHE > > 4. Support DDL for clearing cache for a particular table. > This will clear all the entries in the cache which belong to a > particular table. This will look like > CLEAN CACHE FOR TABLE tablename > > 5. Support DDL to estimate required cache for a particular table. > As explained above, the user does not know beforehand how much cache > will be required for his/her current work. So this DDL will let the > user estimate how much cache will be required for a particular > table. For this we will launch a job and estimate the memory > required for all the blocks, and sum it up. > > 6. Dynamic "max cache size" configration > Suppose now the user knows required cache size he needs, but the > current system requires the user to set the > *carbon.max.driver.lru.cache.size* and restart the JDBC server for > it to take effect. For this I am suggesting to make the carbon > property *carbon.max.driver.lru.cache.size* dynamically configurable > which allows the user to change the max LRU cache size on the fly. > > Any suggestion from the community is greatly appreciated. > > Thanks > > Regards > > Naman Rastogi > Technical Lead - BigData Kernel > Huawei Technologies India Pvt. Ltd. >
Re: [Discussion] DDLs to operate on CarbonLRUCache
+1 for 5,6, after point 5 estimated the cache size, the point 6 can modify the configuration dynamically. +1 for 3,4: maybe need to add a lock to sync the concurrent operations. If it wants to release cache, it will not need to restart the driver Maybe we also need to check how to use these operations in "[DISCUSSION] Distributed Index Cache Server". - Best Regards David Cai -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
RE: Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement
hi ravindra, Got your point. As i had replied to xuchuyain. We can take these index datamap enhancement separately. Thank you -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
RE: Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement
+1 for ravin's advice. We only support lazy/incremental load/rebuild for olap datamap (MV/preagg), not for index datamap currently. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
RE: Re:[DISCUSSION] Support Incremental load in datamap and other MV datamap enhancement
Hi Akash, There is a difference between index datamap (like bloom) and olap datamaps (like MV). Index datamaps used only for pruning the data while olap datamaps will be used as pre-computed data which can be fetched directly as per query. In OLAP datamap case lazy build or deferred build makes sense as data needs to be always synchronized with master data otherwise we will get stale data. So any difference in synchronization will make the datamap disable. But on the other hand Index datamap is used only for faster pruning so synchronization with master data is not mandatory unless we have a mechanism to prune synchronized data using index datamap and non-synchronized data using default datamap. This is the same @xuchuanyin mentioned. I feel this design is about OLAP datamap incremental loading so better not do any changes in the behaviour of index datamaps. We can consider the improvements of Index datamap in future but it should not be part of it. Please update the design if mentioned anything related to Index datamap. Regards, Ravindra. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/