Hi Jark,

we had a long offline discussion yesterday where we considered all options again. The reasons why we decided for the updated design that Bowen suggested:

- Both Dawid and Xuefu argued that in the old design "kind" has binary meanings. I agree here. - Compared to other SQL concepts such as tables/functions/catalogs, a "name" is never part of the object itself but always specified when registering the object. We should have the same behavior for modules because of consistency reasons. It also makes the "name" explicit when it comes to unloading the module compared to the previous design. - Regarding, "How to solve the class conflict problem?" this is an orthogonal issue that will be fixed in future versions once we use Flink's new plugin architecture. If a module is parameterizable, it is the responsibility of the module to prevent class conflicts. If the Hive classes are hidden in a module classloader, it should be possible to load both hive121 and hive234. But in general this problem is unsolved for now, also Kafka tables could clash if you read from two Kafka clusters with different versions.

Regards,
Timo


On 10.10.19 08:01, Jark Wu wrote:
Hi Xuefu,

If there is only one instance per type, then what's the "name" used for?
Could we remove it and only keep "type" or "kind" to identify modules?

Best,
Jark

On Thu, 10 Oct 2019 at 11:21, Xuefu Z <usxu...@gmail.com> wrote:

Jark has a good point. However, I think validation logic can put in place
to restrict one instance per type. Maybe the doc needs to be specific on
this.

Thanks,
Xuefu

On Wed, Oct 9, 2019 at 7:41 PM Jark Wu <imj...@gmail.com> wrote:

Thanks Bowen for the updating.

I have some different opinions on the change.
IIUC, in the previous design, the "name" is also the "id" or "type" to
identify which module to load. Which means we can only load one instance
of
a module.
In the new design, the "name" is just an alias to the module instance,
the
"kind" is used to identify modules. Which means we can load different
instances of a module.
However, what's the "name" or alias used for? Do we need to support
loading
different instances of a module? From my point of view, it brings more
complexity and confusion.
For example, if we load a "hive121" which uses HiveModule with version
1.2.1 and load a "hive234" which uses HiveModule with version 2.3.4, then
how to solve the class conflict problem?

IMO, a module can only be load once in a session, so "name" maybe
useless.
So my proposal is similar to the previous one, but only change "name" to
"kind".

    SQL:
          LOAD MODULE "kind" [WITH (properties)];
          UNLOAD MODULE "kind";
     Table:
          tEnv.loadModule("kind" [, properties]);
          tEnv.unloadModule("kind");

What do you think?


Best,
Jark





On Wed, 9 Oct 2019 at 20:38, Bowen Li <bowenl...@gmail.com> wrote:

Thanks everyone for your review.

After discussing with Timo and Dawid offline, as well as incorporating
feedback from Xuefu and Jark on mailing list, I decided to make a few
critical changes to the proposal.

- renamed the keyword "type" to "kind". The community has plan to have
"type" keyword in yaml/descriptor refer to data types exclusively in
the
near future. We should cater to that change in our design
- allowed specifying names for modules to simplify and unify module
loading/unloading syntax between programming and SQL. Here're the
proposed
changes:
     SQL:
          LOAD MODULE "name" WITH ("kind"="xxx" [, (properties)])
          UNLOAD MODULE "name";
     Table:
          tEnv.loadModule("name", new Xxx(properties));
          tEnv.unloadModule("name");

I have completely updated the google doc [1]. Please take another look,
and
let me know if you have any other questions. Thanks!

[1]


https://docs.google.com/document/d/17CPMpMbPDjvM4selUVEfh_tqUK_oV0TODAUA9dfHakc/edit#

On Tue, Oct 8, 2019 at 6:26 AM Jark Wu <imj...@gmail.com> wrote:

Hi Bowen,

Thanks for the proposal. I have two thoughts:

1) Regarding to "loadModule", how about
tableEnv.loadModule("xxx" [, propertiesMap]);
tableEnv.unloadModule(“xxx”);

This makes the API similar to SQL. IMO, instance of Module is not
needed
and verbose as parameter.
And this makes it easier to load a simple module without any
additional
properties, e.g. tEnv.loadModule("GEO"), tEnv.unloadModule("GEO")

2) In current design, the module interface only defines function
metadata,
but no implementations.
I'm wondering how to call/map the implementations in runtime? Am I
missing
something?

Besides, I left some minor comments in the doc.

Best,
Jark


On Sat, 5 Oct 2019 at 08:42, Xuefu Z <usxu...@gmail.com> wrote:

I agree with Timo that the new table APIs need to be consistent.
I'd
go
further that an name (or id) is needed for module definition in
YAML
file.
In the current design, name is skipped and type has binary
meanings.
Thanks,
Xuefu

On Fri, Oct 4, 2019 at 5:24 AM Timo Walther <twal...@apache.org>
wrote:
Hi everyone,

first, I was also questioning my proposal. But Bowen's proposal
of
`tEnv.offloadToYaml(<file_path>)` would not work with the current
design
because we don't know how to serialize a catalog or module into
properties. Currently, there is no converter from instance to
properties. It is a one way conversion. We can add a
`toProperties`
method to both Catalog and Module class in the future to solve
this.
Solving the table environment serializability can be future work.

However, I find the current proposal for the TableEnvironment
methods
is
contradicting:

tableEnv.loadModule(new Yyy());
tableEnv.unloadModule(“Xxx”);

The loading is specified programmatically whereas the unloading
requires
a string that is not specified in the module itself. But is
defined
in
the factory according to the current design.

SQL does it more consistently. There, the name `xxx` is used when
loading and unloading the module:

LOAD MODULE 'xxx' [WITH ('prop'='myProp', ...)]
UNLOAD MODULE 'xxx’

How about:

tableEnv.loadModule("xxx", new Yyy());
tableEnv.unloadModule(“xxx”);

This would be similar to the catalog interfaces. The name is not
part
of
the instance itself.

What do you think?

Thanks,
Timo




On 01.10.19 21:17, Bowen Li wrote:
If something like the yaml file is the way to go and achieve
such
motivation, we would cover that with current design.

On Tue, Oct 1, 2019 at 12:05 Bowen Li <bowenl...@gmail.com>
wrote:
Hi Timo, Dawid,

I've added the suggested SQL and related changes to
TableEnvironment
API
and other classes to the google doc. Also removed "USE MODULE"
and
its
APIs. Will update FLIP wiki once we have a consensus.

W.r.t. descriptor approach, my gut feeling is similar to
Dawid's.
Besides,
I feel yaml file would be a better solution to persist
serializable
state
of an environment as the file itself is in serializable format
already.
Though yaml file only serves SQL CLI at this moment, we may be
able
to
extend its reach to Table API and allow users to load/offload
a
TableEnvironment from/to yaml files, as something like
"TableEnvironment
tEnv = TableEnvironment.loadFromYaml(<file_path>)" and
"tEnv.offloadToYaml(<file_path>)" to restore and persist
state,
and
try
to
make yaml file more expressive.


On Tue, Oct 1, 2019 at 6:47 AM Dawid Wysakowicz <
dwysakow...@apache.org
wrote:

Hi Timo, Bowen,

Unfortunately I did not have enough time to go through all
the
suggestions in details so I can not comment on the whole
FLIP.
I just wanted to give my opinion on the "descriptor approach
in
loadModule" part. I am not sure if we need it here. We might
be
overthinking this a bit. It definitely makes sense for
objects
like
TableSource/TableSink etc. as they are logical definitions
that
nearly
always have to be persisted in a Catalog. I'm not sure if we
really
need
the same for a whole session. If we need a resume session
feature,
the
way to go would probably be to keep the session in memory on
the
server
side. I fear we will never be able to serialize the whole
session
entirely (temporary objects, objects derived from DataStream
etc.)
I think it is ok to use instances for objects like Catalogs
or
Modules
and have an overlay on top of that that can create instances
from
properties.

Best,

Dawid

On 01/10/2019 11:28, Timo Walther wrote:
Hi Bowen,

thanks for your response.

Re 2) I also don't have a better approach for this issue. It
is
similar to changing the general TableConfig between two
statements.
It
would be good to add your explanation to the design
document.
Re 3) It would be interesting to know about which "core"
functions
we
are actually talking about. Also for the overriding built-in
functions
that we discussed in the other FLIP. But I'm fine with
leaving
it
to
the user for now. How about we just introduce loadModule(),
unloadModule() methods instead of useModules()? This would
ensure
that
users don't forget to add the core module when adding an
additional
module and they need to explicitly call
"unloadModule('core')".
Re 4) Every table environment feature should also be
designed
with
SQL
statements in mind to verify the concept. SQL is also more
popular
that Java/Scala API or YAML file. I would like to add it to
1.10
for
marking the feature as complete.

SHOW MODULES -> sounds good to me, we should add a
listModules():
List<String> method to table environment

LOAD MODULE 'hive' [WITH ('prop'='myProp', ...)] --> we
should
add a
loadModule() method to table environment

UNLOAD MODULE 'hive' --> we should add a unloadModule()
method
to
table environment

I would not introduce `USE MODULES 'x' 'y' 'z'` for
simplicity
and
concise API. Users need to load the module anyway with
properties.
They can also load them "in order" immediately. CREATE TABLE
can
also
not create multiple tables but only one at a time in that
order.
One thing that came to my mind, shall we use a descriptor
approach
for
loadModule()? The past has shown that passing instances
causes
problems when persisting objects. That's why we also want to
get
rid
of registerTableSource. I could image that users might want
to
persist
a table environment's state for later use in the future.
Even
though
this is future work, we should already keep such use cases
in
mind
when adding new API methods. What do you think?

Regards,
Timo


On 30.09.19 23:17, Bowen Li wrote:
Hi Timo,

Re 1) I agree. I renamed the title to "Extend Core Table
System
with
Pluggable Modules" and all internal references

Re 2) First, I'll rename the API to useModules(). The
design
doesn't
forbid
users to call useModules() multi times. Objects in modules
are
loaded
on
demand instead of eagerly, so there won't be inconsistency.
Users
have to
be fully aware of the consequences of resetting modules as
that
might
cause
that some objects can not be referenced anymore or
resolution
order
of some
objects changes.

Re 3) Yes, we'd leave that to users.

Another approach can be to have a non-optional "Core"
module
for
all
objects that cannot be overrode like "CAST" and "AS"
functions,
and
have an
optional "ExtendedCore" module for other replaceable
built-in
objects.
"Core" should be positioned at the 1st in module list all
the
time.
I'm fine with either solution.

Re 4) It may sound like a nice-to-have advanced feature for
1.10,
but
we
can surely fully discuss it for the sake of feature
completeness.
Unlike other configs, the order of modules would matter in
Flink,
and
it
implies the LOAD/UNLOAD commands would not be equal in
operation
positions.
IIUYC, LOAD MODULE 'x' would be interpreted as appending x
to
the
end
of
module list, and UNLOAD MODULE 'x' would be interpreted as
removing x
from
any position in the list?

I'm thinking of the following list of commands:

SHOW MODULES - list modules in order
LOAD MODULE 'hive' [WITH ('prop'='myProp', ...)] - load and
append
the
module to end of the module list
UNLOAD MODULE 'hive' - remove the module from module list,
and
other
modules remain the same relative positions
USE MODULES 'x' 'y' 'z' (wondering can parser take "'x' 'y'
'z'"?),
or USE
MODULES 'x,y,z' - to reorder module list completely


--
Xuefu Zhang

"In Honey We Trust!"


--
Xuefu Zhang

"In Honey We Trust!"


Reply via email to