Translate join predicates into filters

Luca Marchi Wed, 17 May 2023 13:37:57 -0700

Morning everyone, 
in our company we are running a POC using Apache Calcite, and we would like to 
collect some feedbacks from you for the scenario mentioned below.


There is a service API that allows retrieving some `Book`s, and we would like 
to build a table adapter on top of this service; this API 
only accepts a set of IDs, and if no IDs are provided, no result is returned.

```
  interface BookService {
    /** Returns the books matching the given IDs.
     *
     * <p>If not IDs is provided, no result is returned.
     */
    List<Book> findBooksByIds(Set<String> ids);
  }

  record Book(String id, String title) {};
```

A requirement of this table is that it has to support join, and we would like 
to support joining by ID in an efficient way.

The goal is to define a rule that forces the query planner to always push down 
join predicates into a table scan.

Given the following `book` table:

```java
/** A table which represents books, queryable only by their ID. */
final class BookTable extends AbstractTable implements FilterableTable {
  private final BookService service;

  BookTable(BookService service) {
    this.service = service;
  }

  @Override
  public RelDataType getRowType(RelDataTypeFactory typeFactory) {
    return new RelDataTypeFactory.Builder(typeFactory)
        .add("id", SqlTypeName.VARCHAR)
        .add("title", SqlTypeName.VARCHAR)
        .build();
  }

  @Override
  public Enumerable<Object[]> scan(DataContext root, List<RexNode> filters) {
    Set<String> bookIds = getBooksId(filters);
    List<Object[]> result = service.findBooksByIds(bookIds)
        .stream()
        .map(b -> new Object[]{b.id, b.title})
        .toList();

    return Linq4j.asEnumerable(result);
  }

  private static Set<String> getBooksId(List<RexNode> filters) {
    if (filters.size() != 1) {
      throw new IllegalArgumentException("Expected one filter to the ID, found: 
%d".formatted(filters.size()));
    }

    RexNode filter = filters.get(0);
    RexNode leftCondition = ((RexCall) filter).getOperands().get(0);
    RexNode rightCondition = ((RexCall) filter).getOperands().get(1);

    if (leftCondition instanceof RexInputRef left
        && rightCondition instanceof RexLiteral right
        // The index of the ID column is 1.
        && left.getIndex() == 1) {
      if (filter.isA(SqlKind.EQUALS)) {
        String bookId = right.getValue2().toString();
        return ImmutableSet.of(bookId);
      }
      if (filter.isA(SqlKind.SEARCH)) {
        @SuppressWarnings("unchecked")
        Sarg<NlsString> searchArguments = right.getValueAs(Sarg.class);
        return searchArguments.rangeSet.asRanges().stream()
            .map(Range::lowerEndpoint)
            .map(NlsString::getValue)
            .collect(toSet());
      }
    }
    throw new IllegalArgumentException("Unexpected operator, found: 
%s".formatted(filter.getKind()));
  }
}
```

The API of the `BookService` always expects a set of IDs, and in case of query 
like (assuming an entry in `book` matching the ID `'a'`):
```sql
WITH config (id, val) AS (
    VALUES ('a', 3), ('b', 5)
)
SELECT b.* FROM books b
INNER JOIN config ON b.id = config.id
WHERE config.val > 4
```

Calcite produces the following plan:
```
EnumerableCalc(expr#0..2=[{inputs}], id=[$t1], title=[$t2])
  EnumerableMergeJoin(condition=[=($0, $1)], joinType=[inner])
    EnumerableSort(sort0=[$0], dir0=[ASC])
      EnumerableCalc(expr#0..1=[{inputs}], id=[$t0])
        EnumerableTableScan(table=[[books]])
    EnumerableSort(sort0=[$0], dir0=[ASC])
      EnumerableTableScan(table=[[books]])
```

This means Calcite performs a full table scan of the `book` table, and since 
the `RexNode` filters in the `scan` method are empty, no result is returned (in 
this example we are using a value statement scoped views, but ideally the 
solution we are looking for should be valid for other table).
Under some defined circustances, Postgres generates Nested-Loop query plan for 
join: it first selects the row of the table A matching a given condition, then 
iterates over the retrieved rows and performs a scan of table B looking for 
rows that match the join condition; and this seems something we would like to 
enforce it here.

In summary, we would like to implement a table which in case of JOIN is capable 
of loading the individual IDs matching the API of our service, rather than 
performing a full table scan.

Do you have any advises/feedback for us?

Thanks in advance.

Translate join predicates into filters

Reply via email to