Re: [Discussion] Support SegmentLevel MinMax for better Pruning and less driver memory usage
+1 This can reduce the memory footprint in spark driver, it is great for ultra big data Regards, Jacky > 2020年1月14日 下午4:38,Indhumathi 写道: > > Hello all, > > In Cloud scenarios, index is too big to store in SparkDriver, since VM may > not have so much memory. > Currently in Carbon, we will load all indexes to cache for first time query. > Since Carbon LRU Cache does > not support time-based expiration, indexes will be removed from cache based > on LeastRecentlyUsed mechanism, > when the carbon lru cache is full. > > In some scenarios, where user's table has more segments and if user queries > only very few segments often, we no > need to load all indexes to cache. For filter queries, if we prune and load > only matched segments to cache, > then driver's memory will be saved. > > For this purpose, I am planing to add block minmax to segment metadata file > and prune segment based on segment files and > load index only for matched segment. As part of this, will add a > configurable carbon property '*carbon.load.all.index.to.cache*' > to allow user to load all indexes to cache if needed. BY default, value will > be true. > > Currently, for each load, we will write a segment metadata file, while holds > the information about indexFile. > During query, we will read each segmentFile for getting indexFileInfo and > then we will load all datamaps for the segment. > MinMax data will be encoded and stored in segment file. > > Any suggestions/inputs from the community is appreciated. > > Thanks > Indhumathi > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ >
Re: [Discussion] Support SegmentLevel MinMax for better Pruning and less driver memory usage
+1, Can you explain more about how are you encoding and storing min max in segment file?As minmax values represent user data, we cannot store as plain values. Storing encrypted min max will add overhead of encrypting and decrypting. I suggest we can convert segment file to thrift file to solve this. Other suggestions are welcome. Thanks, Ajantha On Tue, 14 Jan, 2020, 4:37 pm Indhumathi, wrote: > Hello all, > > In Cloud scenarios, index is too big to store in SparkDriver, since VM may > not have so much memory. > Currently in Carbon, we will load all indexes to cache for first time > query. > Since Carbon LRU Cache does > not support time-based expiration, indexes will be removed from cache based > on LeastRecentlyUsed mechanism, > when the carbon lru cache is full. > > In some scenarios, where user's table has more segments and if user queries > only very few segments often, we no > need to load all indexes to cache. For filter queries, if we prune and load > only matched segments to cache, > then driver's memory will be saved. > > For this purpose, I am planing to add block minmax to segment metadata file > and prune segment based on segment files and > load index only for matched segment. As part of this, will add a > configurable carbon property '*carbon.load.all.index.to.cache*' > to allow user to load all indexes to cache if needed. BY default, value > will > be true. > > Currently, for each load, we will write a segment metadata file, while > holds > the information about indexFile. > During query, we will read each segmentFile for getting indexFileInfo and > then we will load all datamaps for the segment. > MinMax data will be encoded and stored in segment file. > > Any suggestions/inputs from the community is appreciated. > > Thanks > Indhumathi > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ >
[Discussion] Support SegmentLevel MinMax for better Pruning and less driver memory usage
Hello all, In Cloud scenarios, index is too big to store in SparkDriver, since VM may not have so much memory. Currently in Carbon, we will load all indexes to cache for first time query. Since Carbon LRU Cache does not support time-based expiration, indexes will be removed from cache based on LeastRecentlyUsed mechanism, when the carbon lru cache is full. In some scenarios, where user's table has more segments and if user queries only very few segments often, we no need to load all indexes to cache. For filter queries, if we prune and load only matched segments to cache, then driver's memory will be saved. For this purpose, I am planing to add block minmax to segment metadata file and prune segment based on segment files and load index only for matched segment. As part of this, will add a configurable carbon property '*carbon.load.all.index.to.cache*' to allow user to load all indexes to cache if needed. BY default, value will be true. Currently, for each load, we will write a segment metadata file, while holds the information about indexFile. During query, we will read each segmentFile for getting indexFileInfo and then we will load all datamaps for the segment. MinMax data will be encoded and stored in segment file. Any suggestions/inputs from the community is appreciated. Thanks Indhumathi -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/