Re: Intra-project dependencies

Benedict Mon, 16 Jan 2023 10:54:56 -0800

We have a build script that is invoked by ant to grab a specific SHA (or HEAD of a branch). We were previously just grabbing HEAD but this has the problems mentioned elsewhere in the thread, amongst others. I don’t think it probably matters much if we use a build script or submodules.

I am driven in part by wanting to maintain the library status and not wanting to discard the work done to maintain this, but no less also by my expectation that tying Accord to C* version would entail additional maintenance burden (that might in the near term perhaps fall predominantly on me).

I could be wrong in this prediction of course, but it seems to be a one-sided trade. I don’t think there‘s much extra work with separate repositories even in the worst case of a 1:1 mapping, and we can more easily reverse this decision if there’s no external interest and we really are just 1:1 for several releases.

That said, clearly we don’t want to pursue this approach for every subsystem. So perhaps one of the decisive reasons is indeed the broader utility, but the fact the library is fully decoupled is by itself a strong reason IMO.

I guess an interesting thought exercise to validate this is what other idealised subsystems I might want to apply this approach to. I’ll ponder that.

On 16 Jan 2023, at 18:32, Henrik Ingo <henrik.i...@datastax.com> wrote:

Hi Benedict

At least for my part, again, I'm not (yet) trying to argue for or against a particular alternative. So I think you'll find that if you allow a few more iterations of discussion, we can gravitate to some good consensus. Or failing that, we can at least gravitate around a small number of alternatives and then argue about those :-D

It seems also in your email, the strongest argument for keeping a separate library, is your desire or expectation that Accord would attract significant 3rd party interest. And - this is btw also some advice Magnus Carlsen would give - your main argument therefore is, if we expect we need to make a specific move in the future, it's usually best to just do it immediately.

I didn't write in my previous email, but I did have in mind that one drawback with the proposal of later extracting Accord out of Cassandra into its own repository would be to lose the history of commits. (At least without significant effort to keep/recreate the history.) For example, there could be commits in the Accord history that also edit files in Cassandra. So yes, I agree that if this is a major goal, then keeping Accord development in its own repository is the right choice.

This then leads to the question should the link from Cassandra to Accord be via git sub-modules or via some bash code in the build system. I now remember something that was a major problem for years in the MongoDB CI system, and I believe this is also a problem with our dtests? That the nightly CI system would just check out HEAD of each module, and then compile them and run tests. This had the problem that it was impossible to return to a specific failure, say, a week later, and expect to rebuild and retest the same combination, because the system would just check out and build whatever the HEAD was at that date. (The only way to test the actual SHA you had been bisecting or patching was to submit it as a patch to the CI system. So if a test setup had 5 sub modules, and you were fixing a bug in one of them, you had to "patch" the 4 other ones too, simply because otherwise the CI system wouldn't check out the right position in their history.)

So, whatever method we choose, it's important that our CI system and other tools can know and track the correct and current SHA for each sub-module. Presumably git sub-modules actually are the best answer to this need. How have you dealt with this in Accord so far?

One point: I wouldn't directly compare dtest and Accord though. For a test framework, it's the dtest framework that is consuming a Cassandra version, while for Accord it's Cassandra that depends on a specific Accord version. Because of this, the same solution may or may not be right for both of them.

henrik

On Mon, Jan 16, 2023 at 6:44 PM Benedict <bened...@apache.org> wrote:
How often have we modified Paxos?

There are currently no proposals to develop Accord further after the initial release. So I think it is very likely that Accord development will decouple from Cassandra version, unless there is significant external interest that drives it.

Furthermore, the idea of revisiting this later is problematic. We can’t easily decouple Accord if it becomes tightly coupled with Cassandra, which becomes quite likely when the builds are co-dependent. We have spent great effort developing them separately to avoid this.

You can’t go back later and recover lost interest. How many projects have adopted ZAB, versus Raft?

None of this also addresses the wider need for reform of our approach here, for both the dtest-api and the simulator.

I’m still not clear on the concrete downsides of maintaining a separate tree here? Could somebody explain what they expect to go wrong? I respond to Mick’s points below, as I do not recognise them from our experience. We’ve been doing this for a year without incident.

I will note we explicitly voted to develop Accord as a standalone library as part of the original CEP, and this was debated quite extensively, so to change that will require a new dedicated DISCUSS thread and vote.

- you can no longer just `git clone …` (and we clone automatically in a number of places)
Yes you can, if your build script updates the sub modules like we have been doing.

- same with `git pull …` (easy to be left with out-of-sync submodules)
Yes you can, again for the same reason. This is no different to ensuring your libraries are in sync, which must be done on every pull or checkout.

- permanence from a git SHA no longer exists
It is intact, if you link to a SHA.

- our releases get more complicated (our source tarballs are the asf releases)
How?

- handling patches cover submodules
How is this different to patches affecting multiple versions in C*?

- switching branches, and using git worktrees, during dv
Elaborate? I don’t see any problem, but I might be missing something.

On 16 Jan 2023, at 16:11, Henrik Ingo <henrik.i...@datastax.com> wrote:

Hi all

I was invited to share my thoughts just as an additional and somewhat fresh point of view...

On a high level: We talked through this with Mick and a few other colleagues, and I/we came to the conclusion that fundamentally all of the mentioned options 1-5 are just variations of the same problem being moved into different places. That is to say there's complexity here that isn't going away. This is good to recognize just so that you realize when you are feeling that you don't quite like any of the available options, this is why. At least for me it's somehow calming when you understand this is the reality and you just have to face it.

It seems to me the fundamental question is, will the link from Cassandra to Accord be a 1-1 or n-1 mapping? Superficially we would think that Accord is a separate library and all future Cassandra versions will use the same version of Accord. But is that really the case? Isn't it rather expected that Cassandra 5.1, 5.2 will probably come with more and improved functionality than what will be in 5.0? Fundamental additional functionality like less-than-strict consistency, mvcc, and maybe one day interactive transactions. What I'd expect to see here is then that the separate Accord library in fact is rather closely tied to its parent Cassandra release, and as soon as we have a 5.0 GA, we will also need a stable Accord branch to match, while significant new development will happen in tandem with Cassandra trunk/5.1?

If the latter scenario is more likely, then having Accord in tree seems to be the easiest choice, because it's actually not the case that you are maintaining three copies of the same codebase. (Anymore than that's the case for all Cassandra code.)

FWIW MongoDB does in fact use option 5: At build time there's a bash script that copies your separate WiredTiger repository into the source tree, then compiles. A major reason they did it this way was to support the possiblity that some modules would be closed source. Git modules would not work - or at least be very annoying - for a case where the parent directory is open source but the sub-module is not available to everyone.

But having used the MongoDB system - which apparently is also Accord's system today - I'd say in the end it's just git submodules in a different form: You get to choose whether to manage the library dependency with git or a bash script.

Finally, and I know this was stated before as well, the Accord developers seem hopeful that Accord will gain interest and contributors from outside of Cassandra, and as such warrants its own repository. For arguments sake, let's assume this is possible/likely...

I didn't write this email to support any particular alternative or opinion. But combining the above thoughts, I feel like there is a conclusion sticking out of this email... And the conclusion is of the form "we can always change this later"...

It seems to me that especially now, and probably also after 5.0 is released, we will in any case only have a single version of Cassandra using a singgle version of Accord. So at least to begin with, it's the least effort to keep it in-tree, to avoid the overhead of git submodules, or having to make releases, etc. The separate constituency of Accord-only developers can be satisfied by keeping Accord in its own directory, could even be a top-level directory, and a small build system that can build a separate Accord jar file. You could even maintain a separate github repo just for advertising purposes. (Just like github.com/apache/cassandra isn't the official git repo for Cassandra either.)

If both of my assumptions above are true, then from a Cassandra point of view there's not much benefit having Accord separately, but if 3rd party interest in Accord grows, then it could indeed be split out into its own repository at that point. The main motivation then would be to service those 3rd party developers who aren't so interested in Cassandra. But this split would only be done once it is known that such a community will form.

Thoughts?

henrik

On Mon, Jan 16, 2023 at 2:30 PM Josh McKenzie <jmcken...@apache.org> wrote:
- permanence from a git SHA no longer exists
With the caveat that I haven't worked w/submodules before and only know about them from a cursory search, it looks like git-submodule status would show us the sha for submodules and we could have parent projects reference specific shas to pull for submodules to build? https://git-scm.com/docs/git-submodule/#Documentation/git-submodule.txt-status--cached--recursive--ltpathgt82308203

It seems like our use case is one of the primary ones git submodules are designed to address.

On Mon, Jan 16, 2023, at 6:40 AM, Benedict wrote:

I guess option 5 is what we have today in cep-15, have the build file grab the relevant SHA for the library. This way you maintain a precise SHA for builds and scripts don’t have to be modified.

I believe this is also possible with git submodules, but I’m happy to bake this into our build file instead with a script.

> As the library itself no longer has an explicit version, what I presume you meant by logical version.

I mean that we don’t want to duplicate work and risk diverging functionality maintaining what is logically (meant to be) the same code. As a developer, managing all of the branches is already a pain. Libraries naturally have a different development cadence to the main project, and tying the development to C* versions is just an unnecessary ongoing burden (and risk) that we can avoid.

There’s also an additional penalty: we reduce the likelihood of outside contributions to the libraries only. Accord in particular I hope will attract outside interest if it is maintained as a separate library, as it has broad applicability, and is likely of academic interest. Tying it to C* version and more tightly coupling with C* codebase makes that less likely. We might also see folk interested in our utilities, or our simulator framework, if they were to be maintained separately, which could be valuable.

On 16 Jan 2023, at 10:49, Mick Semb Wever <m...@apache.org> wrote:

I think (4) is the only sensible option. It permits different development branches to easily reference different versions of a library and also to easily co-develop them - from within the same IDE project, even.

I've only heard horror stories about submodules. The challenges they bring should be listed and checked.

Some examples
- you can no longer just `git clone …` (and we clone automatically in a number of places)
- same with `git pull …` (easy to be left with out-of-sync submodules)
- permanence from a git SHA no longer exists
- our releases get more complicated (our source tarballs are the asf releases)
- handling patches cover submodules
- switching branches, and using git worktrees, during dv

I see (4) as a valid option, but concerned with the amount of work required to adapt to it, and whether it will only make it more complicated for the new contributor to the project. For example the first two points are addressed by remembering to do `git clone --recurse-submodules …` . And who would be fixing our build/test/release scripts to accommodate?

Not blockers, just concerns we need to raise and address.

We might even be able to avoid additional release votes as a matter of course, by compiling the library source as part of the C* release, so that they adopt the C* release vote (or else we may periodically release the library as we do other releases)

Yes. Today we do a combination of first (3) and then (1). Having to make a release of these libraries every time a patch (/feature branch) is completing is a horror story in itself.

I might be missing something, does anyone have any other bright ideas for approaching this problem? I’m sure there are plenty of opinions out there.

Looking at the problem with these libraries,
- we don't need releases
- we don't have a clean version/branch parity to in-tree
- codebase parity between branches is important for upgrade tests (shared classloaders)

For (2) you mention drift of the "same" version, isn't this only a problem for dtest-api in the way it requires the "same version" of a codebase for compatibility when running upgrade tests? As the library itself no longer has an explicit version, what I presume you meant by logical version.

To begin with, I'm leaning towards (2) because it is a cognitive re-use of our release branches, and the problems around classpath compatibility can be solved with tests. I'm sure I'm not seeing the whole picture though…

--
Henrik Ingo
+358 40 569 7354

--
Henrik Ingo
+358 40 569 7354

Re: Intra-project dependencies

Reply via email to