Re: [Proposal] | Move FileIO and TextIO from :sdks:java:core to :sdks:java:io:file

2022-12-15 Thread Cristian Constantinescu
Counter argument to the "in one box" thing.

I would like to point out that "having things in one box" is not a reason
to have the code residing in the same module/project.

What the user sees and how the code is structured are two very different
things in my opinion. Beam can certainly have modules developed at
different speeds and packaged "in one box" before the release. The Spring
Framework is a good example of that practice.

I would also like to show a current very simple example where the Beam user
experience is lacking and is unpredictable. In other words, where the
integration between Beam components is non-existent, even if everything is
currently "in one box". Consider this code:

var options = PipelineOptionsFactory.fromArgs(args).create();
var p = Pipeline.create(options);
p.getCoderRegistry().registerCoderForClass(FooAvroRecord.class,
SerializableCoder.of(FooAvroRecord.class));
var file =
App.class.getClassLoader().getResource("avro1.avro").toURI().getPath();
var read = p.apply(AvroIO
.read(FooAvroRecord.class)
.from(file)
);
System.out.println("Using coder:" + read.getCoder());

Can you guess what coder this simple pipeline will output? If you guessed
SerializableCoder, you'd be wrong... it's "Using
coder:org.apache.beam.sdk.coders.AvroCoder@8b0f130", even if the user
explicitly specified the coder it wants to be used.

Going by the argument that there is better integration because "everything
is in one box", there shouldn't be this disconnect between AvroIO and the
CoderRegistry, but here we are.

There are countless examples of these user experiences issues that I can
provide.

Even more frustrating is that not only everything is in one box, but it's
mostly a **closed** box. A simple example, I want to extend the *Utils
(AvroUtils, POJOUtils, etc) so that their respective methods that return
Beam Schema or Schema Coder uses NanosInstant logical type for all
properties of java.time.Instant type because I don't use joda.time.Instant
anywhere in my code. Would be nice to override a given method or inject an
implementation that Bean internals will use or at least some configuration
based solution to achieve this. Yet, to my knowledge, that simply is not
possible right now, so things like the below are broken and very hard to
work around.

var options = PipelineOptionsFactory.fromArgs(args).create();
var p = Pipeline.create(options);
var file =
App.class.getClassLoader().getResource("avro1.avro").toURI().getPath();
var read = p.apply(AvroIO
.read(FooAvroRecord.class)
.withBeamSchemas(true)
.from(file)
);
System.out.println("Using coder:" + read.getCoder());

THis will crash and burn with the following:

Using coder:SchemaCoder
wrote:

> I agree with Sachin. Keeping components that users will have to bring
> together anyway leads to a better user experience. Counter example to that
> is GCP libraries in my opinion. It was a frequent struggle for users to
> find a working set of libraries until there was a BOM. And even after the
> BOM it is still somewhat of a struggle for users and the developers of
> those various libraries need to take on some of the toil of testing those
> various libraries together anyway.
>
> re: Talk it with a grain of salt since I'm not even a committer - All
> inputs are welcome here. I do not think my comments should carry more
> weight just because I am a committer.
>
> On Wed, Dec 14, 2022 at 9:36 AM Sachin Agarwal via dev <
> dev@beam.apache.org> wrote:
>
>> I strongly believe that we should continue to have Beam optimize for the
>> user - and while having separate components would allow those of us who are
>> contributors and committers move faster, the downsides of not having
>> everything "in one box" for a new user where the components are all
>> relatively guaranteed to work together at that version level are very high.
>>
>> Beam having everything included is absolutely a competitive advantage for
>> Beam and I would not want to lose that.
>>
>> On Wed, Dec 14, 2022 at 9:31 AM Byron Ellis via dev 
>> wrote:
>>
>>> Talk it with a grain of salt since I'm not even a committer, but is
>>> perhaps the reorganization of Beam into smaller components the real work of
>>> a 3.0 effort? Splitting of Beam into smaller more independently managed
>>> components would be a pretty huge breaking change from a dependency
>>> management perspective which would potentially be largely separate from any
>>> code changes.
>>>
>>> Best,
>>> B
>>>
>>> On Wed, Dec 14, 2022 at 9:23 AM Alexey Romanenko <
>>> aromanenko@gmail.com> wrote:
>>>
 On 12 Dec 2022, at 22:23, Robert Bradshaw via dev 
 wrote:


 Saving up all the breaking changes until a major release definitely
 has its downsides (look at Python 3). The migration path is often as
 important (if not more so) than the final destination.


 Actually, it proves that the major releases *should 

Re: [Proposal] | Move FileIO and TextIO from :sdks:java:core to :sdks:java:io:file

2022-12-14 Thread Ahmet Altay via dev
I agree with Sachin. Keeping components that users will have to bring
together anyway leads to a better user experience. Counter example to that
is GCP libraries in my opinion. It was a frequent struggle for users to
find a working set of libraries until there was a BOM. And even after the
BOM it is still somewhat of a struggle for users and the developers of
those various libraries need to take on some of the toil of testing those
various libraries together anyway.

re: Talk it with a grain of salt since I'm not even a committer - All
inputs are welcome here. I do not think my comments should carry more
weight just because I am a committer.

On Wed, Dec 14, 2022 at 9:36 AM Sachin Agarwal via dev 
wrote:

> I strongly believe that we should continue to have Beam optimize for the
> user - and while having separate components would allow those of us who are
> contributors and committers move faster, the downsides of not having
> everything "in one box" for a new user where the components are all
> relatively guaranteed to work together at that version level are very high.
>
> Beam having everything included is absolutely a competitive advantage for
> Beam and I would not want to lose that.
>
> On Wed, Dec 14, 2022 at 9:31 AM Byron Ellis via dev 
> wrote:
>
>> Talk it with a grain of salt since I'm not even a committer, but is
>> perhaps the reorganization of Beam into smaller components the real work of
>> a 3.0 effort? Splitting of Beam into smaller more independently managed
>> components would be a pretty huge breaking change from a dependency
>> management perspective which would potentially be largely separate from any
>> code changes.
>>
>> Best,
>> B
>>
>> On Wed, Dec 14, 2022 at 9:23 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> On 12 Dec 2022, at 22:23, Robert Bradshaw via dev 
>>> wrote:
>>>
>>>
>>> Saving up all the breaking changes until a major release definitely
>>> has its downsides (look at Python 3). The migration path is often as
>>> important (if not more so) than the final destination.
>>>
>>>
>>> Actually, it proves that the major releases *should not* be delayed for
>>> a long period of time and *should* be issued more often to reduce the
>>> number of breaking changes (that, of course, likely may happen). That will
>>> help users to do much more smooth and less risky upgrades, and developers
>>> to not keep burden forever. Beam 2.0.0 was released back in may 2017 and
>>> we've almost never talked about Beam 3.0 and what are the criteria for it.
>>> I understand that it’s a completely different discussion but seems that
>>> this time has come =)
>>>
>>> As for this particular change, I would question how the benefit (it's
>>> unclear what the exact benefit is--better internal organization?)
>>> exceeds the pain of making every user refactor their code. I think a
>>> stronger case can be made for things like the Avro dependency that
>>> cause real pain.
>>>
>>>
>>> Agree. I think that if it doesn’t bring any pain with additional
>>> external dependecies and this code is used in almost every other SDK
>>> module, then there are no reasons for such breaking changes. On the other
>>> hand, Avro case, that you mentioned above, is a good example why sometimes
>>> it would be better to keep such code outside of “core”.
>>>
>>> As for the pipeline update feature, we've long discussed having
>>> "pick-your-implementation" transforms that specify alternative,
>>> equivalent implementations. Upgrades can choose the old one whereas
>>> new pipelines can get the latest and greatest. It won't solve all
>>> issues, and requires keeping old codepaths around, but could be an
>>> important step forward.
>>>
>>> On Mon, Dec 12, 2022 at 10:20 AM Kenneth Knowles 
>>> wrote:
>>>
>>>
>>> I agree with Mortiz. To answer a few specifics in my own words:
>>>
>>> - It is a perfectly sensible refactor, but as a counterpoint without
>>> file-based IO the SDK isn't functional so it is also a reasonable design
>>> point to have this included. There are other things in the core SDK that
>>> are far less "core" and could be moved out with greater benefit. The main
>>> goal for any separation of modules would be lighter weight transitive
>>> dependencies, IMO.
>>>
>>> - No, Beam has not made any deliberate breaking changes of this nature.
>>> Hence we are still on major version 2. We have made some bugfixes for data
>>> loss risks that could be called "breaking changes" but since the feature
>>> was unsafe to use in the first place we did not bump the major version.
>>>
>>> - It is sometimes possible to do such a refactor and have the deprecated
>>> location proxy to the new location. In this case that seems hard to achieve.
>>>
>>> - It is not actually necessary to maintain both locations, as we can
>>> declare the old location will be unmaintained (but left alone) and all new
>>> development goes to the new location. That isn't a great choice for users
>>> who may simply upgrade their SDK version 

Re: [Proposal] | Move FileIO and TextIO from :sdks:java:core to :sdks:java:io:file

2022-12-14 Thread Sachin Agarwal via dev
I strongly believe that we should continue to have Beam optimize for the
user - and while having separate components would allow those of us who are
contributors and committers move faster, the downsides of not having
everything "in one box" for a new user where the components are all
relatively guaranteed to work together at that version level are very high.

Beam having everything included is absolutely a competitive advantage for
Beam and I would not want to lose that.

On Wed, Dec 14, 2022 at 9:31 AM Byron Ellis via dev 
wrote:

> Talk it with a grain of salt since I'm not even a committer, but is
> perhaps the reorganization of Beam into smaller components the real work of
> a 3.0 effort? Splitting of Beam into smaller more independently managed
> components would be a pretty huge breaking change from a dependency
> management perspective which would potentially be largely separate from any
> code changes.
>
> Best,
> B
>
> On Wed, Dec 14, 2022 at 9:23 AM Alexey Romanenko 
> wrote:
>
>> On 12 Dec 2022, at 22:23, Robert Bradshaw via dev 
>> wrote:
>>
>>
>> Saving up all the breaking changes until a major release definitely
>> has its downsides (look at Python 3). The migration path is often as
>> important (if not more so) than the final destination.
>>
>>
>> Actually, it proves that the major releases *should not* be delayed for
>> a long period of time and *should* be issued more often to reduce the
>> number of breaking changes (that, of course, likely may happen). That will
>> help users to do much more smooth and less risky upgrades, and developers
>> to not keep burden forever. Beam 2.0.0 was released back in may 2017 and
>> we've almost never talked about Beam 3.0 and what are the criteria for it.
>> I understand that it’s a completely different discussion but seems that
>> this time has come =)
>>
>> As for this particular change, I would question how the benefit (it's
>> unclear what the exact benefit is--better internal organization?)
>> exceeds the pain of making every user refactor their code. I think a
>> stronger case can be made for things like the Avro dependency that
>> cause real pain.
>>
>>
>> Agree. I think that if it doesn’t bring any pain with additional external
>> dependecies and this code is used in almost every other SDK module, then
>> there are no reasons for such breaking changes. On the other hand, Avro
>> case, that you mentioned above, is a good example why sometimes it would be
>> better to keep such code outside of “core”.
>>
>> As for the pipeline update feature, we've long discussed having
>> "pick-your-implementation" transforms that specify alternative,
>> equivalent implementations. Upgrades can choose the old one whereas
>> new pipelines can get the latest and greatest. It won't solve all
>> issues, and requires keeping old codepaths around, but could be an
>> important step forward.
>>
>> On Mon, Dec 12, 2022 at 10:20 AM Kenneth Knowles  wrote:
>>
>>
>> I agree with Mortiz. To answer a few specifics in my own words:
>>
>> - It is a perfectly sensible refactor, but as a counterpoint without
>> file-based IO the SDK isn't functional so it is also a reasonable design
>> point to have this included. There are other things in the core SDK that
>> are far less "core" and could be moved out with greater benefit. The main
>> goal for any separation of modules would be lighter weight transitive
>> dependencies, IMO.
>>
>> - No, Beam has not made any deliberate breaking changes of this nature.
>> Hence we are still on major version 2. We have made some bugfixes for data
>> loss risks that could be called "breaking changes" but since the feature
>> was unsafe to use in the first place we did not bump the major version.
>>
>> - It is sometimes possible to do such a refactor and have the deprecated
>> location proxy to the new location. In this case that seems hard to achieve.
>>
>> - It is not actually necessary to maintain both locations, as we can
>> declare the old location will be unmaintained (but left alone) and all new
>> development goes to the new location. That isn't a great choice for users
>> who may simply upgrade their SDK version and not notice that their old code
>> is now pointing at a version that will not receive e.g. security updates.
>>
>> - I like the style where if/when we transition from Beam 2 to Beam 3 we
>> should have the exact functionality of Beam 3 available as an opt-in flag
>> first. So if a user passes --beam-3 they get exactly what will be the
>> default functionality when we bump the major version. It really is a
>> problem to do a whole bunch of stuff feverishly before a major version
>> bump. The other style that I think works well is the linux kernel style
>> where major versions alternate between stable and unstable (in other words,
>> returning to the 0.x style with every alternating version).
>>
>> - I do think Beam suffers from fear and inability to do significant code
>> gardening. I don't think backwards compatibility in the 

Re: [Proposal] | Move FileIO and TextIO from :sdks:java:core to :sdks:java:io:file

2022-12-14 Thread Byron Ellis via dev
Talk it with a grain of salt since I'm not even a committer, but is perhaps
the reorganization of Beam into smaller components the real work of a 3.0
effort? Splitting of Beam into smaller more independently managed
components would be a pretty huge breaking change from a dependency
management perspective which would potentially be largely separate from any
code changes.

Best,
B

On Wed, Dec 14, 2022 at 9:23 AM Alexey Romanenko 
wrote:

> On 12 Dec 2022, at 22:23, Robert Bradshaw via dev 
> wrote:
>
>
> Saving up all the breaking changes until a major release definitely
> has its downsides (look at Python 3). The migration path is often as
> important (if not more so) than the final destination.
>
>
> Actually, it proves that the major releases *should not* be delayed for a
> long period of time and *should* be issued more often to reduce the
> number of breaking changes (that, of course, likely may happen). That will
> help users to do much more smooth and less risky upgrades, and developers
> to not keep burden forever. Beam 2.0.0 was released back in may 2017 and
> we've almost never talked about Beam 3.0 and what are the criteria for it.
> I understand that it’s a completely different discussion but seems that
> this time has come =)
>
> As for this particular change, I would question how the benefit (it's
> unclear what the exact benefit is--better internal organization?)
> exceeds the pain of making every user refactor their code. I think a
> stronger case can be made for things like the Avro dependency that
> cause real pain.
>
>
> Agree. I think that if it doesn’t bring any pain with additional external
> dependecies and this code is used in almost every other SDK module, then
> there are no reasons for such breaking changes. On the other hand, Avro
> case, that you mentioned above, is a good example why sometimes it would be
> better to keep such code outside of “core”.
>
> As for the pipeline update feature, we've long discussed having
> "pick-your-implementation" transforms that specify alternative,
> equivalent implementations. Upgrades can choose the old one whereas
> new pipelines can get the latest and greatest. It won't solve all
> issues, and requires keeping old codepaths around, but could be an
> important step forward.
>
> On Mon, Dec 12, 2022 at 10:20 AM Kenneth Knowles  wrote:
>
>
> I agree with Mortiz. To answer a few specifics in my own words:
>
> - It is a perfectly sensible refactor, but as a counterpoint without
> file-based IO the SDK isn't functional so it is also a reasonable design
> point to have this included. There are other things in the core SDK that
> are far less "core" and could be moved out with greater benefit. The main
> goal for any separation of modules would be lighter weight transitive
> dependencies, IMO.
>
> - No, Beam has not made any deliberate breaking changes of this nature.
> Hence we are still on major version 2. We have made some bugfixes for data
> loss risks that could be called "breaking changes" but since the feature
> was unsafe to use in the first place we did not bump the major version.
>
> - It is sometimes possible to do such a refactor and have the deprecated
> location proxy to the new location. In this case that seems hard to achieve.
>
> - It is not actually necessary to maintain both locations, as we can
> declare the old location will be unmaintained (but left alone) and all new
> development goes to the new location. That isn't a great choice for users
> who may simply upgrade their SDK version and not notice that their old code
> is now pointing at a version that will not receive e.g. security updates.
>
> - I like the style where if/when we transition from Beam 2 to Beam 3 we
> should have the exact functionality of Beam 3 available as an opt-in flag
> first. So if a user passes --beam-3 they get exactly what will be the
> default functionality when we bump the major version. It really is a
> problem to do a whole bunch of stuff feverishly before a major version
> bump. The other style that I think works well is the linux kernel style
> where major versions alternate between stable and unstable (in other words,
> returning to the 0.x style with every alternating version).
>
> - I do think Beam suffers from fear and inability to do significant code
> gardening. I don't think backwards compatibility in the code sense is the
> biggest blocker. I think the "pipeline update" feature is perhaps the thing
> most holding Beam back from making radical rapid forward progress.
>
> Kenn
>
> On Mon, Dec 12, 2022 at 2:25 AM Moritz Mack  wrote:
>
>
> Hi Damon,
>
>
>
> I fear the current release / versioning strategy of Beam doesn’t lend
> itself well for such breaking changes. Alexey and I have spent quite some
> time discussing how to proceed with the problematic Avro dependency in core
> (and respectively AvroIO, of course).
>
> Such changes essentially always require duplicating code to continue
> supporting a deprecated legacy code path to not 

Re: [Proposal] | Move FileIO and TextIO from :sdks:java:core to :sdks:java:io:file

2022-12-14 Thread Alexey Romanenko
On 12 Dec 2022, at 22:23, Robert Bradshaw via dev  wrote:
> 
> Saving up all the breaking changes until a major release definitely
> has its downsides (look at Python 3). The migration path is often as
> important (if not more so) than the final destination.

Actually, it proves that the major releases should not be delayed for a long 
period of time and should be issued more often to reduce the number of breaking 
changes (that, of course, likely may happen). That will help users to do much 
more smooth and less risky upgrades, and developers to not keep burden forever. 
Beam 2.0.0 was released back in may 2017 and we've almost never talked about 
Beam 3.0 and what are the criteria for it. I understand that it’s a completely 
different discussion but seems that this time has come =)

> As for this particular change, I would question how the benefit (it's
> unclear what the exact benefit is--better internal organization?)
> exceeds the pain of making every user refactor their code. I think a
> stronger case can be made for things like the Avro dependency that
> cause real pain.

Agree. I think that if it doesn’t bring any pain with additional external 
dependecies and this code is used in almost every other SDK module, then there 
are no reasons for such breaking changes. On the other hand, Avro case, that 
you mentioned above, is a good example why sometimes it would be better to keep 
such code outside of “core”.

> As for the pipeline update feature, we've long discussed having
> "pick-your-implementation" transforms that specify alternative,
> equivalent implementations. Upgrades can choose the old one whereas
> new pipelines can get the latest and greatest. It won't solve all
> issues, and requires keeping old codepaths around, but could be an
> important step forward.
> 
> On Mon, Dec 12, 2022 at 10:20 AM Kenneth Knowles  wrote:
>> 
>> I agree with Mortiz. To answer a few specifics in my own words:
>> 
>> - It is a perfectly sensible refactor, but as a counterpoint without 
>> file-based IO the SDK isn't functional so it is also a reasonable design 
>> point to have this included. There are other things in the core SDK that are 
>> far less "core" and could be moved out with greater benefit. The main goal 
>> for any separation of modules would be lighter weight transitive 
>> dependencies, IMO.
>> 
>> - No, Beam has not made any deliberate breaking changes of this nature. 
>> Hence we are still on major version 2. We have made some bugfixes for data 
>> loss risks that could be called "breaking changes" but since the feature was 
>> unsafe to use in the first place we did not bump the major version.
>> 
>> - It is sometimes possible to do such a refactor and have the deprecated 
>> location proxy to the new location. In this case that seems hard to achieve.
>> 
>> - It is not actually necessary to maintain both locations, as we can declare 
>> the old location will be unmaintained (but left alone) and all new 
>> development goes to the new location. That isn't a great choice for users 
>> who may simply upgrade their SDK version and not notice that their old code 
>> is now pointing at a version that will not receive e.g. security updates.
>> 
>> - I like the style where if/when we transition from Beam 2 to Beam 3 we 
>> should have the exact functionality of Beam 3 available as an opt-in flag 
>> first. So if a user passes --beam-3 they get exactly what will be the 
>> default functionality when we bump the major version. It really is a problem 
>> to do a whole bunch of stuff feverishly before a major version bump. The 
>> other style that I think works well is the linux kernel style where major 
>> versions alternate between stable and unstable (in other words, returning to 
>> the 0.x style with every alternating version).
>> 
>> - I do think Beam suffers from fear and inability to do significant code 
>> gardening. I don't think backwards compatibility in the code sense is the 
>> biggest blocker. I think the "pipeline update" feature is perhaps the thing 
>> most holding Beam back from making radical rapid forward progress.
>> 
>> Kenn
>> 
>> On Mon, Dec 12, 2022 at 2:25 AM Moritz Mack  wrote:
>>> 
>>> Hi Damon,
>>> 
>>> 
>>> 
>>> I fear the current release / versioning strategy of Beam doesn’t lend 
>>> itself well for such breaking changes. Alexey and I have spent quite some 
>>> time discussing how to proceed with the problematic Avro dependency in core 
>>> (and respectively AvroIO, of course).
>>> 
>>> Such changes essentially always require duplicating code to continue 
>>> supporting a deprecated legacy code path to not break users’ code. But this 
>>> comes at a very high price. Until the deprecated code path can be finally 
>>> removed again, it must be maintained in two places.
>>> 
>>> Unfortunately, the removal of deprecated code is rather problematic without 
>>> a major version release as it would break semantic versioning and people’s 
>>> expectations. With that deprecations 

Re: [Proposal] | Move FileIO and TextIO from :sdks:java:core to :sdks:java:io:file

2022-12-12 Thread Cristian Constantinescu
Hi,

"As for the pipeline update feature, we've long discussed
having "pick-your-implementation" transforms that specify
alternative, equivalent implementations."

Could someone point me to where this was discussed please? I seem to have
missed that whole topic. Is it like a dependency injection type of thing?
If so, it's one thing I would love to see in Beam.

Thanks,
Cristian

On Mon, Dec 12, 2022 at 4:23 PM Robert Bradshaw via dev 
wrote:

> Saving up all the breaking changes until a major release definitely
> has its downsides (look at Python 3). The migration path is often as
> important (if not more so) than the final destination.
>
> As for this particular change, I would question how the benefit (it's
> unclear what the exact benefit is--better internal organization?)
> exceeds the pain of making every user refactor their code. I think a
> stronger case can be made for things like the Avro dependency that
> cause real pain.
>
> As for the pipeline update feature, we've long discussed having
> "pick-your-implementation" transforms that specify alternative,
> equivalent implementations. Upgrades can choose the old one whereas
> new pipelines can get the latest and greatest. It won't solve all
> issues, and requires keeping old codepaths around, but could be an
> important step forward.
>
> On Mon, Dec 12, 2022 at 10:20 AM Kenneth Knowles  wrote:
> >
> > I agree with Mortiz. To answer a few specifics in my own words:
> >
> >  - It is a perfectly sensible refactor, but as a counterpoint without
> file-based IO the SDK isn't functional so it is also a reasonable design
> point to have this included. There are other things in the core SDK that
> are far less "core" and could be moved out with greater benefit. The main
> goal for any separation of modules would be lighter weight transitive
> dependencies, IMO.
> >
> >  - No, Beam has not made any deliberate breaking changes of this nature.
> Hence we are still on major version 2. We have made some bugfixes for data
> loss risks that could be called "breaking changes" but since the feature
> was unsafe to use in the first place we did not bump the major version.
> >
> >  - It is sometimes possible to do such a refactor and have the
> deprecated location proxy to the new location. In this case that seems hard
> to achieve.
> >
> >  - It is not actually necessary to maintain both locations, as we can
> declare the old location will be unmaintained (but left alone) and all new
> development goes to the new location. That isn't a great choice for users
> who may simply upgrade their SDK version and not notice that their old code
> is now pointing at a version that will not receive e.g. security updates.
> >
> >  - I like the style where if/when we transition from Beam 2 to Beam 3 we
> should have the exact functionality of Beam 3 available as an opt-in flag
> first. So if a user passes --beam-3 they get exactly what will be the
> default functionality when we bump the major version. It really is a
> problem to do a whole bunch of stuff feverishly before a major version
> bump. The other style that I think works well is the linux kernel style
> where major versions alternate between stable and unstable (in other words,
> returning to the 0.x style with every alternating version).
> >
> >  - I do think Beam suffers from fear and inability to do significant
> code gardening. I don't think backwards compatibility in the code sense is
> the biggest blocker. I think the "pipeline update" feature is perhaps the
> thing most holding Beam back from making radical rapid forward progress.
> >
> > Kenn
> >
> > On Mon, Dec 12, 2022 at 2:25 AM Moritz Mack  wrote:
> >>
> >> Hi Damon,
> >>
> >>
> >>
> >> I fear the current release / versioning strategy of Beam doesn’t lend
> itself well for such breaking changes. Alexey and I have spent quite some
> time discussing how to proceed with the problematic Avro dependency in core
> (and respectively AvroIO, of course).
> >>
> >> Such changes essentially always require duplicating code to continue
> supporting a deprecated legacy code path to not break users’ code. But this
> comes at a very high price. Until the deprecated code path can be finally
> removed again, it must be maintained in two places.
> >>
> >> Unfortunately, the removal of deprecated code is rather problematic
> without a major version release as it would break semantic versioning and
> people’s expectations. With that deprecations bear the inherent risk to
> unintentionally deplete quality rather than improving it.
> >>
> >> I’d therefore recommend against such efforts unless there’s very strong
> reasons to do so.
> >>
> >>
> >>
> >> Best, Moritz
> >>
> >>
> >>
> >> On 07.12.22, 18:05, "Damon Douglas via dev" 
> wrote:
> >>
> >>
> >>
> >> Hello Everyone, If you identify yourself on the Beam learning journey,
> even if this is your first day, please see yourself as a welcome
> participant in this conversation and consider reviewing the bottom portion
> of this 

Re: [Proposal] | Move FileIO and TextIO from :sdks:java:core to :sdks:java:io:file

2022-12-12 Thread Robert Bradshaw via dev
Saving up all the breaking changes until a major release definitely
has its downsides (look at Python 3). The migration path is often as
important (if not more so) than the final destination.

As for this particular change, I would question how the benefit (it's
unclear what the exact benefit is--better internal organization?)
exceeds the pain of making every user refactor their code. I think a
stronger case can be made for things like the Avro dependency that
cause real pain.

As for the pipeline update feature, we've long discussed having
"pick-your-implementation" transforms that specify alternative,
equivalent implementations. Upgrades can choose the old one whereas
new pipelines can get the latest and greatest. It won't solve all
issues, and requires keeping old codepaths around, but could be an
important step forward.

On Mon, Dec 12, 2022 at 10:20 AM Kenneth Knowles  wrote:
>
> I agree with Mortiz. To answer a few specifics in my own words:
>
>  - It is a perfectly sensible refactor, but as a counterpoint without 
> file-based IO the SDK isn't functional so it is also a reasonable design 
> point to have this included. There are other things in the core SDK that are 
> far less "core" and could be moved out with greater benefit. The main goal 
> for any separation of modules would be lighter weight transitive 
> dependencies, IMO.
>
>  - No, Beam has not made any deliberate breaking changes of this nature. 
> Hence we are still on major version 2. We have made some bugfixes for data 
> loss risks that could be called "breaking changes" but since the feature was 
> unsafe to use in the first place we did not bump the major version.
>
>  - It is sometimes possible to do such a refactor and have the deprecated 
> location proxy to the new location. In this case that seems hard to achieve.
>
>  - It is not actually necessary to maintain both locations, as we can declare 
> the old location will be unmaintained (but left alone) and all new 
> development goes to the new location. That isn't a great choice for users who 
> may simply upgrade their SDK version and not notice that their old code is 
> now pointing at a version that will not receive e.g. security updates.
>
>  - I like the style where if/when we transition from Beam 2 to Beam 3 we 
> should have the exact functionality of Beam 3 available as an opt-in flag 
> first. So if a user passes --beam-3 they get exactly what will be the default 
> functionality when we bump the major version. It really is a problem to do a 
> whole bunch of stuff feverishly before a major version bump. The other style 
> that I think works well is the linux kernel style where major versions 
> alternate between stable and unstable (in other words, returning to the 0.x 
> style with every alternating version).
>
>  - I do think Beam suffers from fear and inability to do significant code 
> gardening. I don't think backwards compatibility in the code sense is the 
> biggest blocker. I think the "pipeline update" feature is perhaps the thing 
> most holding Beam back from making radical rapid forward progress.
>
> Kenn
>
> On Mon, Dec 12, 2022 at 2:25 AM Moritz Mack  wrote:
>>
>> Hi Damon,
>>
>>
>>
>> I fear the current release / versioning strategy of Beam doesn’t lend itself 
>> well for such breaking changes. Alexey and I have spent quite some time 
>> discussing how to proceed with the problematic Avro dependency in core (and 
>> respectively AvroIO, of course).
>>
>> Such changes essentially always require duplicating code to continue 
>> supporting a deprecated legacy code path to not break users’ code. But this 
>> comes at a very high price. Until the deprecated code path can be finally 
>> removed again, it must be maintained in two places.
>>
>> Unfortunately, the removal of deprecated code is rather problematic without 
>> a major version release as it would break semantic versioning and people’s 
>> expectations. With that deprecations bear the inherent risk to 
>> unintentionally deplete quality rather than improving it.
>>
>> I’d therefore recommend against such efforts unless there’s very strong 
>> reasons to do so.
>>
>>
>>
>> Best, Moritz
>>
>>
>>
>> On 07.12.22, 18:05, "Damon Douglas via dev"  wrote:
>>
>>
>>
>> Hello Everyone, If you identify yourself on the Beam learning journey, even 
>> if this is your first day, please see yourself as a welcome participant in 
>> this conversation and consider reviewing the bottom portion of this email 
>> for guidance. The
>>
>> Hello Everyone,
>>
>>
>>
>> If you identify yourself on the Beam learning journey, even if this is your 
>> first day, please see yourself as a welcome participant in this conversation 
>> and consider reviewing the bottom portion of this email for guidance.
>>
>>
>>
>> The Short Version (For those with Java Beam SDK knowledge):
>>
>>
>>
>> Should we migrate FileIO / TextIO and related classes from :sdks:java:core 
>> to :sdks:java:io:file?  If so, should we target such a migration to a 

Re: [Proposal] | Move FileIO and TextIO from :sdks:java:core to :sdks:java:io:file

2022-12-12 Thread Kenneth Knowles
I agree with Mortiz. To answer a few specifics in my own words:

 - It is a perfectly sensible refactor, but as a counterpoint without
file-based IO the SDK isn't functional so it is also a reasonable design
point to have this included. There are other things in the core SDK that
are far less "core" and could be moved out with greater benefit. The main
goal for any separation of modules would be lighter weight transitive
dependencies, IMO.

 - No, Beam has not made any deliberate breaking changes of this nature.
Hence we are still on major version 2. We have made some bugfixes for data
loss risks that could be called "breaking changes" but since the feature
was unsafe to use in the first place we did not bump the major version.

 - It is sometimes possible to do such a refactor and have the deprecated
location proxy to the new location. In this case that seems hard to achieve.

 - It is not actually necessary to maintain both locations, as we can
declare the old location will be unmaintained (but left alone) and all new
development goes to the new location. That isn't a great choice for users
who may simply upgrade their SDK version and not notice that their old code
is now pointing at a version that will not receive e.g. security updates.

 - I like the style where if/when we transition from Beam 2 to Beam 3 we
should have the exact functionality of Beam 3 available as an opt-in flag
first. So if a user passes --beam-3 they get exactly what will be the
default functionality when we bump the major version. It really is a
problem to do a whole bunch of stuff feverishly before a major version
bump. The other style that I think works well is the linux kernel style
where major versions alternate between stable and unstable (in other words,
returning to the 0.x style with every alternating version).

 - I do think Beam suffers from fear and inability to do significant code
gardening. I don't think backwards compatibility in the code sense is the
biggest blocker. I think the "pipeline update" feature is perhaps the thing
most holding Beam back from making radical rapid forward progress.

Kenn

On Mon, Dec 12, 2022 at 2:25 AM Moritz Mack  wrote:

> Hi Damon,
>
>
>
> I fear the current release / versioning strategy of Beam doesn’t lend
> itself well for such breaking changes. Alexey and I have spent quite some
> time discussing how to proceed with the problematic Avro dependency in core
> (and respectively AvroIO, of course).
>
> Such changes essentially always require duplicating code to continue
> supporting a deprecated legacy code path to not break users’ code. But this
> comes at a very high price. Until the deprecated code path can be finally
> removed again, it must be maintained in two places.
>
> Unfortunately, the removal of deprecated code is rather problematic
> without a major version release as it would break semantic versioning and
> people’s expectations. With that deprecations bear the inherent risk to
> unintentionally deplete quality rather than improving it.
>
> I’d therefore recommend against such efforts unless there’s very strong
> reasons to do so.
>
>
>
> Best, Moritz
>
>
>
> On 07.12.22, 18:05, "Damon Douglas via dev"  wrote:
>
>
>
> Hello Everyone, If you identify yourself on the Beam learning journey,
> even if this is your first day, please see yourself as a welcome
> participant in this conversation and consider reviewing the bottom portion
> of this email for guidance. The
>
> Hello Everyone,
>
>
>
> *If you identify yourself on the Beam learning journey, even if this is
> your first day, please see yourself as a welcome participant in this
> conversation and consider reviewing the bottom portion of this email for
> guidance.*
>
>
>
> *The Short Version (For those with Java Beam SDK knowledge)*:
>
>
>
> Should we migrate FileIO / TextIO and related classes from :sdks:java:core
> to :sdks:java:io:file?  If so, should we target such a migration to a
> future Beam version with repeated announcements?  Does the Beam repository
> have any example of a similar change in the past?  What learnings from said
> past change could be potentially applied to this one?
>
>
>
> *The Long Version (For those on the learning path)*:
>
>
>
> This email is more about our repository organization rather than Beam.
> The proposal is to move two highly used classes (and anything related) in
> our Java SDK called FileIO [1] and TextIO [2].  The Beam GitHub repository
> uses a software called gradle [3], to automate routine code tasks such as
> build and test.  Gradle projects, such as Beam, organize code in what are
> called modules [4].  The three main ingredients that make a module are 1) a
> unique directory path, 2) a file called build.gradle (or build.gradle.kts)
> in this directory, 3) referencing the gradle module in a settings.gradle
> (or settings.gradle.kts) file at the root of the repository.
>
>
>
> The gradle documentation discusses why such organization might matter and
> how to achieve this 

Re: [Proposal] | Move FileIO and TextIO from :sdks:java:core to :sdks:java:io:file

2022-12-12 Thread Moritz Mack
Hi Damon,

I fear the current release / versioning strategy of Beam doesn’t lend itself 
well for such breaking changes. Alexey and I have spent quite some time 
discussing how to proceed with the problematic Avro dependency in core (and 
respectively AvroIO, of course).
Such changes essentially always require duplicating code to continue supporting 
a deprecated legacy code path to not break users’ code. But this comes at a 
very high price. Until the deprecated code path can be finally removed again, 
it must be maintained in two places.
Unfortunately, the removal of deprecated code is rather problematic without a 
major version release as it would break semantic versioning and people’s 
expectations. With that deprecations bear the inherent risk to unintentionally 
deplete quality rather than improving it.
I’d therefore recommend against such efforts unless there’s very strong reasons 
to do so.

Best, Moritz

On 07.12.22, 18:05, "Damon Douglas via dev"  wrote:

Hello Everyone, If you identify yourself on the Beam learning journey, even if 
this is your first day, please see yourself as a welcome participant in this 
conversation and consider reviewing the bottom portion of this email for 
guidance. The

Hello Everyone,

If you identify yourself on the Beam learning journey, even if this is your 
first day, please see yourself as a welcome participant in this conversation 
and consider reviewing the bottom portion of this email for guidance.

The Short Version (For those with Java Beam SDK knowledge):

Should we migrate FileIO / TextIO and related classes from :sdks:java:core to 
:sdks:java:io:file?  If so, should we target such a migration to a future Beam 
version with repeated announcements?  Does the Beam repository have any example 
of a similar change in the past?  What learnings from said past change could be 
potentially applied to this one?

The Long Version (For those on the learning path):

This email is more about our repository organization rather than Beam.  The 
proposal is to move two highly used classes (and anything related) in our Java 
SDK called FileIO [1] and TextIO [2].  The Beam GitHub repository uses a 
software called gradle [3], to automate routine code tasks such as build and 
test.  Gradle projects, such as Beam, organize code in what are called modules 
[4].  The three main ingredients that make a module are 1) a unique directory 
path, 2) a file called build.gradle (or build.gradle.kts) in this directory, 3) 
referencing the gradle module in a settings.gradle (or settings.gradle.kts) 
file at the root of the repository.

The gradle documentation discusses why such organization might matter and how 
to achieve this with large projects [5].  Essentially, modules allow us to have 
mini-projects inside our large project and focus related automations to this 
one focused portion of our larger repository.  In Beam, we have the module 
:sdks:java:core [6] with all things related to the core of Beam, whereas we 
have separate modules related to reading from and writing to various resources 
within :sdks:java:io [7].

The proposal suggests moving the aforementioned file reading and writing 
classes, FileIO and TextIO, and anything related, to its own :sdks:java:io:file 
module.  This would correspond to a new sdks/java/io/file directory and moving 
these classes into sdks/java/io/file/main/java/org/apache/beam/sdk/io/file.

Definitions / References:

1. FileIO - a General-purpose transforms for working with files: listing files 
(matching), reading and writing.  See - 
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/FileIO.html

2. TextIO - Similar to FileIO but focused on text files.  See 
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/TextIO.html

3. Gradle - a build automation tool used by the Apache Beam repository to 
automate code-related tasks.  See 
https://docs.gradle.org/current/userguide/what_is_gradle.html

4. Gradle Module - a subsection of your larger repository.  See 
https://docs.gradle.org/current/userguide/dependency_management_terminology.html#sub:terminology_module

5. Structuring Large Projects with Gradle - 

[Proposal] | Move FileIO and TextIO from :sdks:java:core to :sdks:java:io:file

2022-12-07 Thread Damon Douglas via dev
Hello Everyone,

*If you identify yourself on the Beam learning journey, even if this is
your first day, please see yourself as a welcome participant in this
conversation and consider reviewing the bottom portion of this email for
guidance.*

*The Short Version (For those with Java Beam SDK knowledge)*:

Should we migrate FileIO / TextIO and related classes from :sdks:java:core
to :sdks:java:io:file?  If so, should we target such a migration to a
future Beam version with repeated announcements?  Does the Beam repository
have any example of a similar change in the past?  What learnings from said
past change could be potentially applied to this one?

*The Long Version (For those on the learning path)*:

This email is more about our repository organization rather than Beam.  The
proposal is to move two highly used classes (and anything related) in our
Java SDK called FileIO [1] and TextIO [2].  The Beam GitHub repository uses
a software called gradle [3], to automate routine code tasks such as build
and test.  Gradle projects, such as Beam, organize code in what are called
modules [4].  The three main ingredients that make a module are 1) a unique
directory path, 2) a file called build.gradle (or build.gradle.kts) in this
directory, 3) referencing the gradle module in a settings.gradle (or
settings.gradle.kts) file at the root of the repository.

The gradle documentation discusses why such organization might matter and
how to achieve this with large projects [5].  Essentially, modules allow us
to have mini-projects inside our large project and focus related
automations to this one focused portion of our larger repository.  In Beam,
we have the module :sdks:java:core [6] with all things related to the core
of Beam, whereas we have separate modules related to reading from and
writing to various resources within :sdks:java:io [7].

The proposal suggests moving the aforementioned file reading and writing
classes, FileIO and TextIO, and anything related, to its own
:sdks:java:io:file module.  This would correspond to a new
sdks/java/io/file directory and moving these classes into
sdks/java/io/file/main/java/org/apache/beam/sdk/io/file.

*Definitions / References*:

1. FileIO - a General-purpose transforms for working with files: listing
files (matching), reading and writing.  See -
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/FileIO.html

2. TextIO - Similar to FileIO but focused on text files.  See
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/TextIO.html

3. Gradle - a build automation tool used by the Apache Beam repository to
automate code-related tasks.  See
https://docs.gradle.org/current/userguide/what_is_gradle.html

4. Gradle Module - a subsection of your larger repository.  See
https://docs.gradle.org/current/userguide/dependency_management_terminology.html#sub:terminology_module

5. Structuring Large Projects with Gradle -
https://docs.gradle.org/current/userguide/structuring_software_products.html

6. sdks:java:core - Corresponds to the sdks/java/core repository directory.
See https://github.com/apache/beam/tree/master/sdks/java/core

7. sdks:java:io - Corresponds to the sdks/java/io repository directory.
See https://github.com/apache/beam/tree/master/sdks/java/io

Best,

Damon