Bug#908678: Some more thoughts and some tests on the security-tracker git repo
On 2018-11-09 16:05:06, Antoine Beaupré wrote: > 2. do a crazy filter-branch to send commits to the right > files. considering how long an initial clone takes, i can't even > begin to imagine how long *that* would take. but it would be the > most accurate simulation. > > Short of that, I think it's somewhat dishonest to compare a clean > repository with split files against a repository with history over 14 > years and thousands of commits. Intuitively, I think you're right and > that "sharding" the data in yearly packets would help a lot git's > performance. But we won't know until we simulate it, and if hit that > problem again 5 years from now, all that work will have been for > nothing. (Although it *would* give us 5 years...) So I've done that crzy filter-branch, on a shallow clone (1000 commits). The original clone is about 30MB, but the split repo is only 4MB. Cloning the original repo takes a solid 30+ seconds: [1221]anarcat@curie:src130$ time git clone file://$PWD/security-tracker-1000.orig security-tracker-1000.orig-test Clonage dans 'security-tracker-1000.orig-test'... remote: Énumération des objets: 5291, fait. remote: Décompte des objets: 100% (5291/5291), fait. remote: Compression des objets: 100% (1264/1264), fait. remote: Total 5291 (delta 3157), réutilisés 5291 (delta 3157) Réception d'objets: 100% (5291/5291), 8.80 MiB | 19.47 MiB/s, fait. Résolution des deltas: 100% (3157/3157), fait. 64.35user 0.44system 0:34.32elapsed 188%CPU (0avgtext+0avgdata 200056maxresident)k 0inputs+58968outputs (0major+48449minor)pagefaults 0swaps Cloning the split repo takes less than a second: [1223]anarcat@curie:src$ time git clone file://$PWD/security-tracker-1000-filtered security-tracker-1000-filtered-test Clonage dans 'security-tracker-1000-filtered-test'... remote: Énumération des objets: 2214, fait. remote: Décompte des objets: 100% (2214/2214), fait. remote: Compression des objets: 100% (1190/1190), fait. remote: Total 2214 (delta 936), réutilisés 2214 (delta 936) Réception d'objets: 100% (2214/2214), 1.25 MiB | 22.78 MiB/s, fait. Résolution des deltas: 100% (936/936), fait. 0.25user 0.04system 0:00.38elapsed 79%CPU (0avgtext+0avgdata 8200maxresident)k 0inputs+8664outputs (0major+3678minor)pagefaults 0swaps So this is clearly a win, and I think it would be possible to rewrite the history using the filter-branch command. Commit IDs would change, but we would keep all commits and so annotate and all that good stuff would still work. The split-by-year bash script was too slow for my purposes: it was taking a solid 15 seconds for each run, which meant it would have taken 9 *days* to process the entire repository. So I tried to see if this could be optimized, so we could split the file while keeping history without having to shutdown the whole system for days. I first rewrote it in Python, which processed the 1000 commits in 801 seconds. This gives an estimate of 15 hours for the 68278 commits I had locally. Concerned about the Python startup time, I then tried golang, which processed the tree in 262 seconds, giving final estimate of 4.8 hours. Attached are both implementations, for those who want to reproduce my results. Note that they differ from the original implementation in that they have to (naturally) remove the data/CVE/list file itself otherwise it's kept in history. Here's how to call it: git -c commit.gpgSign=false filter-branch --tree-filter '/home/anarcat/src/security-tracker/bin/split-by-year.py data/CVE/list' HEAD Also observe how all gpg commit signatures are (obviously) lost. I have explicitely disabled that because those actually take a long time to compute... I haven't tested if a graft would improve performance, but I suspect it would not, given the sheer size of the repository that would effectively need to be carried over anyways. A. -- Man really attains the state of complete humanity when he produces, without being forced by physical need to sell himself as a commodity. - Ernesto "Che" Guevara package main import ( "bufio" "bytes" "io" "log" "os" "strconv" "strings" ) func main() { file, err := os.Open("data/CVE/list") if err != nil { log.Fatal(err) } defer file.Close() var ( line []byte cve []byte year uint64 year_str string target *os.File header bool ) fds := make(map[uint64]*os.File, 20) scanner := bufio.NewReader(file) for { line, err = scanner.ReadBytes('\n') if bytes.HasPrefix(line, []byte("CVE-")) { cve = line year_str = strings.Split(string(line), "-")[1] year, _ = strconv.ParseUint(year_str, 0, 0) header = true } else { if target, ok := fds[year]; !ok { target, err = os.OpenFile("data/CVE/list."+year_str, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644) if err != nil { log.Fatal(err) } fds[year] = target } if header { target.Write(cve) header = false } target.Write(line) } if err != nil { break } }
Bug#908678: Some more thoughts and some tests on the security-tracker git repo
On 2018-09-26 14:56:16, Daniel Lange wrote: [...] > In any case, a repo with just the split files but no maintained history clones > in ~12s in the above test setup. It also brings the (bare) repo down from > 3,3GB > to 189MB. So the issue is really the data/CVE/list file. So I've looked in that problem as well, four months ago: https://salsa.debian.org/security-tracker-team/security-tracker/issues/2 In there I proposed splitting the data/CVE/list file into "one file per CVE". In retrospect, that was a rather naive approach and yielded all sorts of problems: there were so many files that it create problems even for the shell (argument list too long). I hadn't thought of splitting things in "one *file* per year". That could really help! Unfortunately, it's hard to simulate what it would look like *14 years* from now (yes, that's how old that repo is already). I can think of two ways to simulate that: 1. generate commits to recreate all files from scratch: parse data/CVE/list, split it up into chunks, and add each CVE in one separate commit. it's not *exactly* how things are done now, but it should be a close enough approximation 2. do a crazy filter-branch to send commits to the right files. considering how long an initial clone takes, i can't even begin to imagine how long *that* would take. but it would be the most accurate simulation. Short of that, I think it's somewhat dishonest to compare a clean repository with split files against a repository with history over 14 years and thousands of commits. Intuitively, I think you're right and that "sharding" the data in yearly packets would help a lot git's performance. But we won't know until we simulate it, and if hit that problem again 5 years from now, all that work will have been for nothing. (Although it *would* give us 5 years...) > That said, data/DSA/list is 14575 lines. That seems to not bother git too much > yet. Still if things get re-structured, this file may be worth a look, too. Yeah, I haven't had trouble with that one yet either. > To me the most reasonable path forward unfortunately looks like start a new > repo > for 2019+ and "just" import the split files or single-record files as > mentioned > by pabs but not the git/svn/cvs history. The old repo would - of course - stay > around but frozen at a deadline. In any case, I personally don't think history over those files is that critical. We rarely dig into that history because it's so expensive... Any "git annotate" takes forever in this repo, and running *that* it over data/CVE/list takes tens of minutes. That said, once we pick a solution, we *could* craft a magic filter-branch that *would* keep history. It might be worth eating that performance cost then. I'll run some tests to see if I can make sense of such a filter. > Corsac also mentioned on IRC that the repo could be hosted outside of Gitlab. > That would reduce the pressure for some time. > But cgit and other git frontends (as well as backends) we tested also struggle > with the repo (which is why my company, Faster IT GmbH, used the > security-tracker > repo as a very welcome test case in the first place). > So that would buy time but not be a solution long(er) term. Agreed. I think the benefits of hosting on gitlab outweigh the trouble in rearchitecturing our datastore. As I said, it's not just gitlab that's struggling with a 17MB text file: git itself has trouble dealing with it as well, and I am often frustrated by that in my work... A. -- You are absolutely deluded, if not stupid, if you think that a worldwide collection of software engineers who can't write operating systems or applications without security holes, can then turn around and suddenly write virtualization layers without security holes. - Theo de Raadt
Re: Bug#908678: Some more thoughts and some tests on the security-tracker git repo
Hi, [not contributing right now with ideas, just giving one important datapoint to me to the discussion] On Wed, Sep 26, 2018 at 03:15:14PM +0200, Guido Günther wrote: > Not necessarily. Maybe a graft would do: > > > https://developer.atlassian.com/blog/2015/08/grafting-earlier-history-with-git/ > > This is IMHO preferable over history rewrites. I've used this to tie > histories in the past. I've not used "git replace" though but > .git/info/grafts. FWIW on this point, for the securiy team members worklfows it is quite importannt aspect (even admittely can be slow) to have access to history of commits while working on their own checkouts. So that would be a feature that in any splitup work done should be considered, either in a rewrite-history situation or as mentioned above, or other possibilties which will arise. Thank you! Regards, Salvatore
Bug#908678: Some more thoughts and some tests on the security-tracker git repo
Hi, On Wed, Sep 26, 2018 at 01:56:16PM +0200, Daniel Lange wrote: > The main issue is that we need to get clone and diff+render operations > back into normal time frames. The salsa workers (e.g. to render a > diff) time out after 60s. Similar time constraints are put onto other I wonder why that is since "git diff" is pretty fast on a local checkout. Did we ask the gitlab folks about it? [..snip..] > Just splitting the file will not do. We need to (unfortunately) > somehow "get rid" of the history (delta-resolution) walks in git: Not necessarily. Maybe a graft would do: https://developer.atlassian.com/blog/2015/08/grafting-earlier-history-with-git/ This is IMHO preferable over history rewrites. I've used this to tie histories in the past. I've not used "git replace" though but .git/info/grafts. Cheers, -- Guido
Bug#908678: Some more thoughts and some tests on the security-tracker git repo
The main issue is that we need to get clone and diff+render operations back into normal time frames. The salsa workers (e.g. to render a diff) time out after 60s. Similar time constraints are put onto other rendering frond-ends. Actually you can easily get Apache to segfault if you do not time-constrain cgi/fcgi type processes. But that's out of scope here. Back on topic: Just splitting the file will not do. We need to (unfortunately) somehow "get rid" of the history (delta-resolution) walks in git: # test setup limits: Network bw: 200 MBit, client system: 4 core $ time git clone https://.../debian_security_security-tracker Klone nach 'debian_security_security-tracker' ... remote: Counting objects: 334274, done. remote: Compressing objects: 100% (67288/67288), done. remote: Total 334274 (delta 211939), reused 329399 (delta 208905) Empfange Objekte: 100% (334274/334274), 165.46 MiB | 21.93 MiB/s, Fertig. Löse Unterschiede auf: 100% (211939/211939), Fertig. real14m13,159s user27m23,980s sys 0m17,068s # Run the tool already available to split the main CVE/list # file into annual files. Thanks Raphael Geissert! $ bin/split-by-year # remove the old big CVE/list file $ git rm data/CVE/list # get the new files into git $ git add data/CVE/list.* $ git commit --all [master a06d3446ca] Remove list and commit bin/split-by-year results 21 files changed, 342414 insertions(+), 342414 deletions(-) delete mode 100644 data/CVE/list create mode 100644 data/CVE/list.1999 create mode 100644 data/CVE/list.2000 create mode 100644 data/CVE/list.2001 create mode 100644 data/CVE/list.2002 create mode 100644 data/CVE/list.2003 create mode 100644 data/CVE/list.2004 create mode 100644 data/CVE/list.2005 create mode 100644 data/CVE/list.2006 create mode 100644 data/CVE/list.2007 create mode 100644 data/CVE/list.2008 create mode 100644 data/CVE/list.2009 create mode 100644 data/CVE/list.2010 create mode 100644 data/CVE/list.2011 create mode 100644 data/CVE/list.2012 create mode 100644 data/CVE/list.2013 create mode 100644 data/CVE/list.2014 create mode 100644 data/CVE/list.2015 create mode 100644 data/CVE/list.2016 create mode 100644 data/CVE/list.2017 create mode 100644 data/CVE/list.2018 # this one is fast: $ git push # create a new clone $ time git clone https://.../debian_security_security-tracker_split_files test-clone Klone nach 'test-clone' ... remote: Counting objects: 334298, done. remote: Compressing objects: 100% (67312/67312), done. remote: Total 334298 (delta 211943), reused 329399 (delta 208905) Empfange Objekte: 100% (334298/334298), 168.91 MiB | 21.28 MiB/s, Fertig. Löse Unterschiede auf: 100% (211943/211943), Fertig. real14m35,444s user27m45,500s sys 0m21,100s --> so splitting alone doesn't help. Git is not clever enough to not run through the deltas of not to be checked-out files. git 2.18's git2 wire protocol could be used with server-side filtering but that's an awful hack. Telling people to git clone --depth 1 #(shallow) like Guido advises is easier and more reliable for the clone use-case. For the original repo that will take ~1.5s, for a split-by-year repo ~0.2s. There are tools to split git files and keep the history e.g. https://github.com/potherca-bash/git-split-file but we'd need (to create) one that also zaps the old deltas. So really "rewrite history" as the git folks tend to call this. git filter-branch can do this. But it would get somewhat complex and murky with commits that span CVE/list-year and list-year+1 which are at least 21 for 2018+2017, 19 for 2017+2016 and ~10 for previous year combos. So I wouldn't put too much effort into that path. In any case, a repo with just the split files but no maintained history clones in ~12s in the above test setup. It also brings the (bare) repo down from 3,3GB to 189MB. So the issue is really the data/CVE/list file. That said, data/DSA/list is 14575 lines. That seems to not bother git too much yet. Still if things get re-structured, this file may be worth a look, too. To me the most reasonable path forward unfortunately looks like start a new repo for 2019+ and "just" import the split files or single-record files as mentioned by pabs but not the git/svn/cvs history. The old repo would - of course - stay around but frozen at a deadline. Corsac also mentioned on IRC that the repo could be hosted outside of Gitlab. That would reduce the pressure for some time. But cgit and other git frontends (as well as