Hi Venu, a. *Presto carbondata support reading bloom index*, so I want to correct your initial statement "Presto engine do not make use of indexes(SI, Bloom etc) in query processing"
b. Between option1 and option2 the main difference is *option1 is multi-threaded and option2 is distributed.* The performance of the option1 will be bad. Hence even though we need spark index server cluster (currently presto carbondata always need spark cluster to write carbondata) *I want to go with option2.* c. For option2, the implementation you cannot do like bloom as we need to read the whole SI table with filter, so suggest to make a dataframe by querying the SI table (which calls CarbonScanRDD) and once you get the matched blocklets, make a split for main table from that based on block level or blocklet level task distribution. Thanks, Ajantha On Tue, Jan 5, 2021 at 5:31 PM VenuReddy <k.venureddy2...@gmail.com> wrote: > Hi all.! > > At present Carbon table queries with Presto engine do not make use of > indexes(SI, Bloom etc) in query processing. Exploring feasible approaches > without query plan rewrite to make use of secondary indexes(if any > available) similar to that of existing datamap. > > * > Option 1: > * Presto get splits for main table to find the suitable SI table, scan, get > the position references from SI table and return the splits for main table > accordingly. > Tentative Changes: > > 1. Make a new CoarseGrainIndex implementation for SI. > 2. Within context of CarbondataSplitManager.getSplits() for main table, in > CarbonInputFormat.getPrunedBlocklets(), we can do prune with new > CoarseGrainIndex implementation for SI(similar to that of bloom). Inside > Prune(), Identify the best suitable SI table, Use SDK CarbonReader to scan > the identified SI table, get the position references to matching predicate. > Need to think of reading the table in multiple threads. > 3. Modify the filter expression to append positionId filter with obtained > position references from SI table read. > 4. In the context of CarbondataPageSource, create QueryModel with modified > filter expression. > Rest of the processing remains same as before. > *Advantages:* > 1. Can avoid the query plan rewrite and yet make use of SI tables. > 2. Can leverage SI with any execution engine. > *DisAdvantages:* > 1. Reading SI table in the context of CarbondataSplitManager.getSplits() of > main table, possibly may degrade the query performance. Need to have enough > resource to spawn multiple threads for reading within it. > > * > Option 2: > * Use Index Server to prune(enable distributed pruning). > Tentative Changes: > > 1. Make a new CoarseGrainIndex implementation for SI. > 2. On Index Server, during getSplits() for main table, in the context of > DistributedPruneRDD.internalCompute()(i.e., on Index server executors) > within pruneIndexes() can do prune with new CoarseGrainIndex implementation > for SI(similar to that of bloom). Inside Prune(), Identify the best > suitable > SI table, Use CarbonReader to read the SI table, get the position > references > to matching predicate. > 3. Return the extended blocklets for main table > 4. Need to check how to return/transform filter expression to append > positionId filter with position references which are read from SI table > from > Index Server to Driver along with pruned blocklets?? > *Advantages:* > 1. Can avoid the query plan rewrite and yet make use of SI tables. > *DisAdvantages:* > 1. Index Server Executors memory would be occupied for SI table reading. > 2. Concurrent queries may have impact as Index server is used for SI table > reading. > 3. Index Server must be running. > > We can introduce a new Carbon property to switch between present and the > new > approach being proposed. We may consider the secondary index table storage > file format change later. > > Please let me know your opinion/suggestion if we can go with Option-1 or > Option-2 or both Option 1 + 2 or any other suggestion ? > > > Thanks, > Venu Reddy > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ >