[DISCUSS] Stream tables vs hash joins

Vladimir Sitnikov Fri, 03 Jan 2020 10:20:07 -0800

Hi,

Stream tables do not play very well for hash joins.
In other words, if hash join would try to build a lookup table out of a
stream, it could just run out of memory.


Is there metadata or something like that to identify stream-like inputs so
hash join would ensure it does not
try to build a lookup table out of the stream?

The case is org.apache.calcite.test.StreamTest#testStreamToRelationJoin
which transforms to the following.
The plan is wrong because it would build hash lookup out of the second
input which happens to be (infinite?) (STREAM).

As a temporary workaround, I will increase the estimated rowcount for
orders table to 100'000, but it would be nice to make those decisions
metadata-driven.

EnumerableProject(ROWTIME=[$2], ORDERID=[$3], SUPPLIERID=[$1]): rowcount =
3000.0, cumulative cost = {6950.0 rows, 9650.0 cpu, 0.0 io}, id = 603
  EnumerableHashJoin(condition=[=($0, $6)], joinType=[inner]): rowcount =
3000.0, cumulative cost = {3950.0 rows, 650.0 cpu, 0.0 io}, id = 602
    EnumerableInterpreter: rowcount = 200.0, cumulative cost = {100.0 rows,
100.0 cpu, 0.0 io}, id = 599
      BindableTableScan(table=[[STREAM_JOINS, PRODUCTS]]): rowcount =
200.0, cumulative cost = {2.0 rows, 2.0100000000000002 cpu, 0.0 io}, id =
122
    EnumerableProject(ROWTIME=[$0], ID=[$1], PRODUCT=[$2], UNITS=[$3],
PRODUCT0=[CAST($2):VARCHAR(32) NOT NULL]): rowcount = 100.0, cumulative
cost = {150.0 rows, 550.0 cpu, 0.0 io}, id = 601
      EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0
rows, 50.0 cpu, 0.0 io}, id = 600
        BindableTableScan(table=[[STREAM_JOINS, ORDERS, (STREAM)]]):
rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 182

Vladimir

[DISCUSS] Stream tables vs hash joins

Reply via email to