On Mon, Sep 4, 2023 at 4:38 AM William Kenworthy <bi...@iinet.net.au> wrote: > > On 4/9/23 16:04, Nuno Silva wrote: > > > > (But note that Rich was suggesting using the *search* feature of the > > gitweb interface, which, in this case, also finds the same topmost > > commit if I search for "reedsolomon".) > > > tkx, missed that!
Note that in terms of indexing git and CVS have their pros and cons, because they use different data structures. I've heard the saying that Git is a data structure masquerading as an SCM, and certainly the inconsistencies in the command line operations bear that out. Git tends to be much more useful in general, but for things like finding deleted files CVS was definitely more time-efficient. The reason for this is that everything in git is reachable via commits, and these are reachable from a head via a linked list. The most recent commit gives access to the current version of the repository, and a pointer to the immediately previous commit(s). To find a deleted file, git must go to the most recent commit in whatever branch you are searching, then descend its tree to look for the file. If it is not found, it then goes to the previous commit and descends that tree. There are 745k commits in the active Gentoo repository. I think there are something like 2M of them in the historical one. Each commit is a random seek, and then each step down the directory tree to find a file is another random seek. In CVS everything is organized first by file, and then each file has its own commit history. So finding a file, deleted or otherwise, just requires a seek for each level in the directory tree. Then you can directly read its history. So finding an old deleted file in the gentoo git repo can require millions of reads, while doing so in CVS only required about 3. It is no surprise that the web interfaces were designed to make that operation much easier - if you do sufficiently complex searches in the git web interface it will time you out to avoid bogging down the server, which is why some searches may require you to clone the repo and do it locally. Now, if you want to find out what changed in a particular commit the situation is reversed. If you identify a commit in git and want to see what changed, it can directly read the commit from disk using its hash. It then looks at the parent commit, then descends both trees doing a diff at each level. Since everything is content-hashed only directory trees that contain differences need to be read. If a commit had changes to 50 files, it might only take 10 reads to figure out which files changed, and then another 100 to compare the contents of each file and generate diffs. If you wanted to do that in CVS you'd have to read every single file in the repository and read the sequential history of each file to find any commits that have the same time/author. CVS commits also aren't atomic so ordering across files might not be the same. Git is a thing of beauty when you think about what it was designed to do and how well-suited to this design its architecture is. The same can be said of several data-driven FOSS applications. The right algorithm can make a huge difference... -- Rich