Re: Dependency management for multiple IOs

Kenneth Knowles Tue, 19 Feb 2019 11:40:41 -0800

Interesting. Since Schema and Row are already in the core SDK, how much
more needs to move somewhere else? Something like this?


sdks/java/tables contains Table, TableProvider, etc, as needed to implement
the service loader bits. Could also go in core, since schema and Row are
there, but depends how far out this stuff gets. Do these classes have any
use in the core?
sdks/java/io/... each depend on sdks/java/tables if they have a SQL
interface
sdks/java/extensions/sql depends on tables and loads from the classpath

It seems very clean if it is this simple and we don't leak anything so
SQL-specific, and obviously not anything Calcite-specific.

Kenn

On Tue, Feb 19, 2019 at 10:59 AM Andrew Pilloud <apill...@google.com> wrote:

> I've thought about this a bit as well. It would be nice if we can move the
> SQL-specific IO wrappers into the IOs themselves. Your first proposal and
> long term match my thoughts. This at least keeps SQL from making the
> dependency problem worse than it is right now.
>
> On the general Java dependency problem, I don't think there is an easy way
> to solve it.
>
> On Mon, Feb 18, 2019 at 7:27 PM Kenneth Knowles <k...@google.com> wrote:
>
>> Yea. Beam SQL basically just has a clone of the Beam IO problem. So the
>> question is whether to mirror the module structure or to put all the
>> SQL-specific adapters in one module (or a few) with lots of optional
>> dependencies, and probably a complex build.gradle to support running a
>> variety of integration tests with different stuff on the classpath.
>>
>> Kenn
>>
>> On Mon, Feb 18, 2019 at 11:33 AM Reuven Lax <re...@google.com> wrote:
>>
>>> I don't think this is a SQL-specific problem. Beam (especially when
>>> using a variety of different IOs in a single pipeline) aggregates many
>>> dependencies into one binary, which sometimes creates this sort of pain.
>>> When users have organizational reasons to pin specific dependencies, then
>>> things get even worse. I still don't know if there is a perfect solution to
>>> all of this.
>>>
>>> Reuven
>>>
>>> On Fri, Feb 15, 2019 at 7:42 PM Kenneth Knowles <k...@apache.org> wrote:
>>>
>>>> I'm not totally convinced Beam's dep versions are the issue here. A
>>>> user may have an organizational requirement of a particular version of,
>>>> say, Kafka and Hive. So when they depend on Beam they probably pin those
>>>> versions of Kafka and Hive which they have determined work together, and
>>>> they hope that the Beam IOs work together.
>>>>
>>>> I see this as a choice between two scenarios for users:
>>>>
>>>> 1. SQL <------- KafkaTable (@AutoService) ------> KafkaIO
>>>> ---provided-----> Kafka
>>>> 2. SQL (includes KafkaTable) ----optional----> KafkaIO
>>>> -----provided-----> Kakfa
>>>>
>>>> For users of 1, they depend on Beam Java, Beam SQL, SQL Kafka Table,
>>>> and pin a version of Kafka
>>>> For users of 2, they depend on Beam Java, Beam SQL, KakfaIO, and pin a
>>>> version of Kafka
>>>>
>>>> To be honest it is really hard to see which is preferable. I think
>>>> number 1 has fewer funky dependency edges, more simple "compile + runtime"
>>>> dependencies.
>>>>
>>>> Kenn
>>>>
>>>>
>>>>
>>>>
>>>> Kenn
>>>>
>>>> On Fri, Feb 15, 2019 at 6:06 PM Chamikara Jayalath <
>>>> chamik...@google.com> wrote:
>>>>
>>>>> I think the underlying problem is two modules of Beam transitively
>>>>> depending on conflicting dependencies (a.k.a. the diamond dependency
>>>>> problem) ?
>>>>>
>>>>> I think the general solution for this is two fold. (at least the way
>>>>> we have formulated in https://beam.apache.org/contribute/dependencies/
>>>>> )
>>>>>
>>>>> (1) Keep Beam dependencies as much as possible hoping that transitive
>>>>> dependencies stay compatible (we rely on semantic versioning here to not
>>>>> cause problems for differences in minor/patch versions. Might not be the
>>>>> case in practice for some dependencies).
>>>>> (2) For modules with outdated dependencies that we cannot upgrade due
>>>>> to some reason, we'll vendor those modules.
>>>>>
>>>>> Not sure if your specific problem need something more.
>>>>>
>>>>> Thanks,
>>>>> Cham
>>>>>
>>>>> On Fri, Feb 15, 2019 at 4:48 PM Anton Kedin <ke...@google.com> wrote:
>>>>>
>>>>>> Hi dev@,
>>>>>>
>>>>>> I have a problem, I don't know a good way to approach the dependency
>>>>>> management between Beam SQL and Beam IOs, and want to collect thoughts
>>>>>> about it.
>>>>>>
>>>>>> Beam SQL depends on specific IOs so that users can query them. The
>>>>>> IOs need their dependencies to work. Sometimes the IOs also leak their
>>>>>> transitive dependencies (e.g. HCatRecord leaked from HCatalogIO). So if 
>>>>>> in
>>>>>> SQL we want to build abstractions on top of these IOs we risk having to
>>>>>> bundle the whole IOs or the leaked dependencies. Overall we can probably
>>>>>> avoid it by making the IOs `provided` dependencies, and by refactoring 
>>>>>> the
>>>>>> code that leaks. In this case things can be made to build, simple tests
>>>>>> will run, and we won't need to bundle the IOs within SQL.
>>>>>>
>>>>>> But as soon as there's a need to actually work with multiple IOs at
>>>>>> the same time the conflicts appear. For example, for testing of
>>>>>> Hive/HCatalog IOs in SQL we need to create an embedded Hive Metastore
>>>>>> instance. It is a very Hive-specific thing that requires its own
>>>>>> dependencies that have to be loaded during testing as part of SQL 
>>>>>> project.
>>>>>> And some other IOs (e.g. KafkaIO) can bring similar but conflicting
>>>>>> dependencies which means that we cannot easily work with or test both IOs
>>>>>> at the same time within SQL. I think it will become insane as number of 
>>>>>> IOs
>>>>>> supported in SQL grows.
>>>>>>
>>>>>> So the question is how to avoid conflicts between IOs within SQL?
>>>>>>
>>>>>> One approach is to create separate packages for each of the
>>>>>> SQL-specific IO wrappers, e.g. `beam-sdks-java-extensions-sql-hcatalog`, 
>>>>>> `beam-sdks-java-extensions-sql-kafka`,
>>>>>> etc. These projects will compile-depend on Beam SQL and on specific IO.
>>>>>> Beam SQL will load these either from user-specified configuration or
>>>>>> something like @AutoService at runtime. This way Beam SQL doesn't know
>>>>>> about the details of the IOs and their dependencies, and they can be 
>>>>>> easily
>>>>>> tested in isolation without conflicting with each other. This should also
>>>>>> be relatively simple to manage if things change, the build logic should 
>>>>>> be
>>>>>> straightforward and easy to update. On the negative side, each of the
>>>>>> projects will require its own separate build logic, it will not be easy 
>>>>>> to
>>>>>> test multiple IOs together within SQL, and users will have to manage the
>>>>>> conflicting dependencies by themselves.
>>>>>>
>>>>>> Another approach is to keep things roughly as they are but create
>>>>>> separate configurations within the main `build.gradle` in SQL project,
>>>>>> where configurations will correspond to separate IOs or use cases (e.g.
>>>>>> testing of Hive-related IOs). The benefit is that everything related to 
>>>>>> SQL
>>>>>> IOs stays roughly in one place (including build logic) and can be built 
>>>>>> and
>>>>>> tested together when possible. Negative side is that it will probably
>>>>>> involve some groovy magic and classpath manipulation within Gradle tasks 
>>>>>> to
>>>>>> make the configurations work, plus it may be brittle if we change our
>>>>>> top-level Beam build logic. And this approach also doesn't make it easier
>>>>>> for the users to manage the conflicts.
>>>>>>
>>>>>> Longer term we could probably also reduce the abstraction thickness
>>>>>> on top of the IOs, so that Beam SQL can work directly with IOs. For this 
>>>>>> to
>>>>>> work the supported IOs will need to expose things like `readRows()` and
>>>>>> get/set the schema on the PCollection. This is probably aligned with the
>>>>>> Schema work that's happening at the moment but I don't know whether it
>>>>>> makes sense to focus on this right now. The problem of the dependencies 
>>>>>> is
>>>>>> not solved here as well but I think it will be at least the same problem 
>>>>>> as
>>>>>> the users already have if they see conflicts when using mutliple IOs with
>>>>>> Beam pipelines.'
>>>>>>
>>>>>> Thoughts, ideas? Did anyone ever face a problem like this or am I
>>>>>> completely misunderstanding something in  Beam build logic?
>>>>>>
>>>>>> Regards,
>>>>>> Anton
>>>>>>
>>>>>

Re: Dependency management for multiple IOs

Reply via email to