Hello all!

I'm having a problem with TableSources' DataStream being duplicated when
pulled on from 2 sinks.

I understand that sometimes the best plan might just be to duplicate and
read both times a TableSource/SourceFunction; but in my case I can't quite
reproduce the data as say Kafka would. I just need the SourceFunction and
DataStream provided by the TableSource to not be duplicated.

As a workaround to this issue, I introduce some sort of materialization
barrier that makes the planner pull only on one instance of the
TableSource/SourceFunction:
Instead of:

tEnv.registerTableSource("foo_table", new FooTableSource());

I convert it to an Append Stream, and back again to a Table:

tEnv.registerTableSource("foo_table_source", new FooTableSource());
Table sourceTable = tEnv.sqlQuery("SELECT * FROM foo_table_source");
Table appendingSourceTable = tEnv.fromDataStream(
        tEnv.toAppendStream(sourceTable, Types.ROW(new
String[]{"field_1"}, new TypeInformation[]{Types.LONG()}))
);
tEnv.registerTable("foo_table", appendingSourceTable);

And the conversion to an Append Stream somewhat makes the planner behave
and there is only one DataSource in the execution plan.

But I'm feeling like I'm just missing a simple option (on the
SourceFunction, or on the TableSource?) to invoke and declare the Source as
being non duplicateable.

I have tried a lot of options (uid(), operation chaining restrictions,
twiddling the transformation, forceNonParallel(), etc.), but can't find
quite how to do that! My SourceFunction is a RichSourceFunction

At this point I'm wondering if this is a bug, or if it is a feature that
would have to be implemented.

Cheers,
Ben

Reply via email to