On Mon, Feb 12, 2018 at 10:35:06AM +0000, Russel Winder via Digitalmars-d wrote: > In all the discussion of Dub to date, it hasn't been pointed out that > JVM building merged dependency management and build a long time ago. > Historically: > > Make → Ant → Maven → Gradle > > and Gradle can handle C++ as well as JVM language builds. > > So the integration of package management and build as seen in Go, > Cargo, and Dub is not a group of outliers. Could it be then that it is > the right thing to do. After all package management is a dependency > management activity and build is a dependency management activity, so > why separate them, just have a single ADG to describe the whole thing.
I have no problem with using a single ADG/DAG to describe the whole thing. However, a naïve implementation of this raises a few issues: If a dependent node requires network access, it forces network access every time the DAG is updated. This is slow, and also unreliable: the shape of the DAG could, in theory, change arbitrarily at any time outside the control of the user. If I'm debugging a program, the very last thing I want to happen is that the act of building the software also pulls in new library versions that cause the location of the bug to shift, thereby ruining any progress I may have made on narrowing down its locus. It would be nice to locally cache such network-dependent nodes so that they are only refreshed on demand. Furthermore, a malicious external entity can introduce arbitrary changes into the DAG, e.g., hijack an intermediate DNS server so that network lookups get redirected to a malicious server which then adds dependencies on malware to your DAG. The next time you update: boom, your software now contains a trojan horse. (Even better if you have integrated package dependencies with builds all the way to deployment: now all your customers have a copy of the trojan deployed on their machines, too.) To mitigate this, some kind of security model would be required (e.g., verifiable server certificates, cryptographically signed package payloads). Which adds to the cost of refreshing network nodes, and hence is another big reason why this should be done on-demand, NOT automatically every time you ask for a new software build. Also, if the machine I'm working on happens to be offline, it would totally suck to be unable to build my project just because of that. The whole point of having a DAG is reliable builds, and having the graph depend on remote resources over an inherently unreliable network defeats the purpose. That is why caching is basically mandatory, as is control over when the network is accessed. And furthermore, one always has to be mindful of the occasional need to rollback. Generally, source code control is used for the local source code component -- if you need to revert a change, just checkout an earlier revision from your repo. But if a network resource that used to provide library X v1.0 now has moved on to X v2.0, and has dropped all support for v1.0 so that it is no longer downloadable from the server, then rollback is no longer possible. You are now unable to reproduce a build you made 2 years ago. (Which you might need to, if a customer environment is still running the old version and you need to debug it.) IOW, the network is inherently unreliable. Some form of local caching / cache revision control is required. [...] > Then, is a DevOps world, there is deployment, which is usually a > dependency management task. Is a totally new tool doing ADG > manipulation really needed for this? My answer is: the ADG/DAG manipulation should be a *library*, a reusable component that can be integrated into diverse systems that require it. Multiple systems that implement functionality X is not necessarily a valid reason to argue for merging said systems into a single monolithic monster. Rather, what it *does* suggest is to factor out functionality X so that it can be reused across said systems. [...] > Merging ideas from Dub, Gradle, and Reggae, into a project management > tool for D (with C) projects is relatively straightforward of plan > albeit really quite a complicated project. Creating the core ADG > processing is the first requirement. It has to deal with external > dependencies, project build dependencies, and deployment dependencies. Your last sentence already shows that such a project is ill-advised, because while all of them in an abstract sense reduce to nothing but DAG manipulation, that is not an argument for integrating all systems that happen to use DAGs as a core algorithm into a single monolithic system. Rather, it's an indication that DAG manipulation code ought to be a common library that's reused across systems that require such functionality, i.e., external dependencies, build dependencies, and deployment dependencies. It's really very simple. If your code has function X and function Y, and X and Y have a lot of code in common, it does not mean you should write function Z that can perform the role of both X and Y. Rather, it means you should factor out the common parts into function W, and reuse W from X and Y. (Alas, the former is seen all too often in large "enterprise" software, where functions start out being straightforward with a clean API, and end up being a monstrous chimera with 50 non-orthogonal, sometimes mutually-contradictory parameters, that can nevertheless do everything you want -- if you can only figure out what exactly each parameter means and which subset of parameters are actually relevant to what you want.) Similarly, if you have systems P, Q, and R, and they all have DAG manipulation as a common functionality, that is an argument for factoring out said DAG manipulation as a reusable component. It is not an argument for making a new system S that includes everything that P, Q, and R can do. (Unless S can also provide new functionality that P, Q, and R could not have been able to achieve without such integration.) [...] > (*) The O(N) vs. O(n), SCons vs. Tup thing that T raised in another > thread is important, but actually it is an implementation thing of how > do you detect change, it isn't an algorithmic issue at a system design > level. But it is important. The O(N) vs. O(n) issue is actually very important once you generalize beyond the specifics of build dependencies, esp. if you start talking about network-dependent DAGs. If a task has a DAG that depends on, say, 100 network nodes, then I absolutely do NOT want the dependency resolution tool to be querying all 100 nodes every time I ask for a refresh. That's just ridiculously inefficient. Rather, the tool should subscribe for updates from the network servers so that they inform it when their part of the DAG changes. IOW, the amount of network traffic should be proportional to the number of *changes* in the remote nodes, NOT the *total* number of nodes. Similarly, for deployment management, if my project has 100 installation targets (remote customer machines), each of which has 1000 entities (let's say files, like data files and executables), then I really do NOT want to have to scan all 1000 entities on all 100 installation targets, just to decide that only 50 files on 2 installation targets have changed. I should be able to push out only the files that have changed, and not everything else. IOW, the size of the update should be proportional to the size of the change, NOT the total size of the entire deployment. Otherwise it is simply not scalable and will quickly become impractical as project sizes grow. If such considerations are not integrated into the system design at the top level, you can be sure that there will be inherent design flaws that preclude efficient implementation later on. IOW, DAG updates must be proportional to the size of the DAG change. Nowhere must there be any algorithm that requires scanning the entire DAG (unless the changeset covers the entire DAG). T -- Guns don't kill people. Bullets do.