[ https://issues.apache.org/jira/browse/CALCITE-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846227#comment-16846227 ]
Julian Hyde commented on CALCITE-2979: -------------------------------------- By 'assigning keys to batches' I meant deciding when you have enough keys in a batch that it is worthwhile making a round-trip. [~zabetak] described a strategy using BLK_SIZE. Let's BLK_SIZE is 1000. Then the first batch will consist of key 0 through key 999. The second batch will consist of key 1000 through key 1999. But it's not the only possible strategy for assigning keys to batches. Let's suppose that the keys are date values, and you are probing into a table that is partitioned by month. You could construct batches that contain all keys that fall within a particular month, and therefore you would make at most one round-trip to each partition. Other strategies are possible, if you generalize away from using fixed size batches with BLK_SIZE entires. I will note that what you are doing is a form of work scheduling. (Fetching a row that corresponds to a particular key is an item of work, and of course grouping work items into batches is a well-known way to increase efficiency by trading off latency.) There are many strategies for work scheduling, such as [round-robin|https://en.wikipedia.org/wiki/Round-robin_scheduling]. Some of those strategies could be considered here. Schedulers often distribute work via queues, and I have a feeling that queues could be useful in this algorithm, too. > Add a block-based nested loop join algorithm > -------------------------------------------- > > Key: CALCITE-2979 > URL: https://issues.apache.org/jira/browse/CALCITE-2979 > Project: Calcite > Issue Type: Improvement > Components: core > Affects Versions: 1.19.0 > Reporter: Stamatis Zampetakis > Assignee: Khawla Mouhoubi > Priority: Major > Labels: performance > > Currently, Calcite provides a tuple-based nested loop join algorithm > implemented through EnumerableCorrelate and EnumerableDefaults.correlateJoin. > This means that for each tuple of the outer relation we probe (set variables) > in the inner relation. > The goal of this issue is to add new algorithm (or extend the correlateJoin > method) which first gathers blocks (batches) of tuples from the outer > relation and then probes the inner relation once per block. > There are cases (eg., indexes) where the inner relation can be accessed by > more than one value which can greatly improve the performance in particular > when the outer relation is big. -- This message was sent by Atlassian JIRA (v7.6.3#76005)