Re: Dependency management for multiple IOs

Chamikara Jayalath Fri, 15 Feb 2019 18:06:58 -0800

I think the underlying problem is two modules of Beam transitively
depending on conflicting dependencies (a.k.a. the diamond dependency
problem) ?


I think the general solution for this is two fold. (at least the way we
have formulated in https://beam.apache.org/contribute/dependencies/)

(1) Keep Beam dependencies as much as possible hoping that transitive
dependencies stay compatible (we rely on semantic versioning here to not
cause problems for differences in minor/patch versions. Might not be the
case in practice for some dependencies).
(2) For modules with outdated dependencies that we cannot upgrade due to
some reason, we'll vendor those modules.

Not sure if your specific problem need something more.

Thanks,
Cham

On Fri, Feb 15, 2019 at 4:48 PM Anton Kedin <[email protected]> wrote:

> Hi dev@,
>
> I have a problem, I don't know a good way to approach the dependency
> management between Beam SQL and Beam IOs, and want to collect thoughts
> about it.
>
> Beam SQL depends on specific IOs so that users can query them. The IOs
> need their dependencies to work. Sometimes the IOs also leak their
> transitive dependencies (e.g. HCatRecord leaked from HCatalogIO). So if in
> SQL we want to build abstractions on top of these IOs we risk having to
> bundle the whole IOs or the leaked dependencies. Overall we can probably
> avoid it by making the IOs `provided` dependencies, and by refactoring the
> code that leaks. In this case things can be made to build, simple tests
> will run, and we won't need to bundle the IOs within SQL.
>
> But as soon as there's a need to actually work with multiple IOs at the
> same time the conflicts appear. For example, for testing of Hive/HCatalog
> IOs in SQL we need to create an embedded Hive Metastore instance. It is a
> very Hive-specific thing that requires its own dependencies that have to be
> loaded during testing as part of SQL project. And some other IOs (e.g.
> KafkaIO) can bring similar but conflicting dependencies which means that we
> cannot easily work with or test both IOs at the same time within SQL. I
> think it will become insane as number of IOs supported in SQL grows.
>
> So the question is how to avoid conflicts between IOs within SQL?
>
> One approach is to create separate packages for each of the SQL-specific
> IO wrappers, e.g. `beam-sdks-java-extensions-sql-hcatalog`, 
> `beam-sdks-java-extensions-sql-kafka`,
> etc. These projects will compile-depend on Beam SQL and on specific IO.
> Beam SQL will load these either from user-specified configuration or
> something like @AutoService at runtime. This way Beam SQL doesn't know
> about the details of the IOs and their dependencies, and they can be easily
> tested in isolation without conflicting with each other. This should also
> be relatively simple to manage if things change, the build logic should be
> straightforward and easy to update. On the negative side, each of the
> projects will require its own separate build logic, it will not be easy to
> test multiple IOs together within SQL, and users will have to manage the
> conflicting dependencies by themselves.
>
> Another approach is to keep things roughly as they are but create separate
> configurations within the main `build.gradle` in SQL project, where
> configurations will correspond to separate IOs or use cases (e.g. testing
> of Hive-related IOs). The benefit is that everything related to SQL IOs
> stays roughly in one place (including build logic) and can be built and
> tested together when possible. Negative side is that it will probably
> involve some groovy magic and classpath manipulation within Gradle tasks to
> make the configurations work, plus it may be brittle if we change our
> top-level Beam build logic. And this approach also doesn't make it easier
> for the users to manage the conflicts.
>
> Longer term we could probably also reduce the abstraction thickness on top
> of the IOs, so that Beam SQL can work directly with IOs. For this to work
> the supported IOs will need to expose things like `readRows()` and get/set
> the schema on the PCollection. This is probably aligned with the Schema
> work that's happening at the moment but I don't know whether it makes sense
> to focus on this right now. The problem of the dependencies is not solved
> here as well but I think it will be at least the same problem as the users
> already have if they see conflicts when using mutliple IOs with Beam
> pipelines.'
>
> Thoughts, ideas? Did anyone ever face a problem like this or am I
> completely misunderstanding something in  Beam build logic?
>
> Regards,
> Anton
>

Re: Dependency management for multiple IOs

Reply via email to