Re: Dependency management for multiple IOs

Kenneth Knowles Fri, 15 Feb 2019 19:35:59 -0800

I'm not totally convinced Beam's dep versions are the issue here. A user
may have an organizational requirement of a particular version of, say,
Kafka and Hive. So when they depend on Beam they probably pin those
versions of Kafka and Hive which they have determined work together, and
they hope that the Beam IOs work together.


I see this as a choice between two scenarios for users:

1. SQL <------- KafkaTable (@AutoService) ------> KafkaIO ---provided----->
Kafka
2. SQL (includes KafkaTable) ----optional----> KafkaIO -----provided----->
Kakfa

For users of 1, they depend on Beam Java, Beam SQL, SQL Kafka Table, and
pin a version of Kafka
For users of 2, they depend on Beam Java, Beam SQL, KakfaIO, and pin a
version of Kafka

To be honest it is really hard to see which is preferable. I think number 1
has fewer funky dependency edges, more simple "compile + runtime"
dependencies.

Kenn




Kenn

On Fri, Feb 15, 2019 at 6:06 PM Chamikara Jayalath <[email protected]>
wrote:

> I think the underlying problem is two modules of Beam transitively
> depending on conflicting dependencies (a.k.a. the diamond dependency
> problem) ?
>
> I think the general solution for this is two fold. (at least the way we
> have formulated in https://beam.apache.org/contribute/dependencies/)
>
> (1) Keep Beam dependencies as much as possible hoping that transitive
> dependencies stay compatible (we rely on semantic versioning here to not
> cause problems for differences in minor/patch versions. Might not be the
> case in practice for some dependencies).
> (2) For modules with outdated dependencies that we cannot upgrade due to
> some reason, we'll vendor those modules.
>
> Not sure if your specific problem need something more.
>
> Thanks,
> Cham
>
> On Fri, Feb 15, 2019 at 4:48 PM Anton Kedin <[email protected]> wrote:
>
>> Hi dev@,
>>
>> I have a problem, I don't know a good way to approach the dependency
>> management between Beam SQL and Beam IOs, and want to collect thoughts
>> about it.
>>
>> Beam SQL depends on specific IOs so that users can query them. The IOs
>> need their dependencies to work. Sometimes the IOs also leak their
>> transitive dependencies (e.g. HCatRecord leaked from HCatalogIO). So if in
>> SQL we want to build abstractions on top of these IOs we risk having to
>> bundle the whole IOs or the leaked dependencies. Overall we can probably
>> avoid it by making the IOs `provided` dependencies, and by refactoring the
>> code that leaks. In this case things can be made to build, simple tests
>> will run, and we won't need to bundle the IOs within SQL.
>>
>> But as soon as there's a need to actually work with multiple IOs at the
>> same time the conflicts appear. For example, for testing of Hive/HCatalog
>> IOs in SQL we need to create an embedded Hive Metastore instance. It is a
>> very Hive-specific thing that requires its own dependencies that have to be
>> loaded during testing as part of SQL project. And some other IOs (e.g.
>> KafkaIO) can bring similar but conflicting dependencies which means that we
>> cannot easily work with or test both IOs at the same time within SQL. I
>> think it will become insane as number of IOs supported in SQL grows.
>>
>> So the question is how to avoid conflicts between IOs within SQL?
>>
>> One approach is to create separate packages for each of the SQL-specific
>> IO wrappers, e.g. `beam-sdks-java-extensions-sql-hcatalog`, 
>> `beam-sdks-java-extensions-sql-kafka`,
>> etc. These projects will compile-depend on Beam SQL and on specific IO.
>> Beam SQL will load these either from user-specified configuration or
>> something like @AutoService at runtime. This way Beam SQL doesn't know
>> about the details of the IOs and their dependencies, and they can be easily
>> tested in isolation without conflicting with each other. This should also
>> be relatively simple to manage if things change, the build logic should be
>> straightforward and easy to update. On the negative side, each of the
>> projects will require its own separate build logic, it will not be easy to
>> test multiple IOs together within SQL, and users will have to manage the
>> conflicting dependencies by themselves.
>>
>> Another approach is to keep things roughly as they are but create
>> separate configurations within the main `build.gradle` in SQL project,
>> where configurations will correspond to separate IOs or use cases (e.g.
>> testing of Hive-related IOs). The benefit is that everything related to SQL
>> IOs stays roughly in one place (including build logic) and can be built and
>> tested together when possible. Negative side is that it will probably
>> involve some groovy magic and classpath manipulation within Gradle tasks to
>> make the configurations work, plus it may be brittle if we change our
>> top-level Beam build logic. And this approach also doesn't make it easier
>> for the users to manage the conflicts.
>>
>> Longer term we could probably also reduce the abstraction thickness on
>> top of the IOs, so that Beam SQL can work directly with IOs. For this to
>> work the supported IOs will need to expose things like `readRows()` and
>> get/set the schema on the PCollection. This is probably aligned with the
>> Schema work that's happening at the moment but I don't know whether it
>> makes sense to focus on this right now. The problem of the dependencies is
>> not solved here as well but I think it will be at least the same problem as
>> the users already have if they see conflicts when using mutliple IOs with
>> Beam pipelines.'
>>
>> Thoughts, ideas? Did anyone ever face a problem like this or am I
>> completely misunderstanding something in  Beam build logic?
>>
>> Regards,
>> Anton
>>
>

Re: Dependency management for multiple IOs

Reply via email to