Bug#908678: Some more thoughts and some tests on the security-tracker git repo

2018-11-09 Thread Antoine Beaupré
On 2018-11-09 16:05:06, Antoine Beaupré wrote:
>  2. do a crazy filter-branch to send commits to the right
> files. considering how long an initial clone takes, i can't even
> begin to imagine how long *that* would take. but it would be the
> most accurate simulation.
>
> Short of that, I think it's somewhat dishonest to compare a clean
> repository with split files against a repository with history over 14
> years and thousands of commits. Intuitively, I think you're right and
> that "sharding" the data in yearly packets would help a lot git's
> performance. But we won't know until we simulate it, and if hit that
> problem again 5 years from now, all that work will have been for
> nothing. (Although it *would* give us 5 years...)

So I've done that crzy filter-branch, on a shallow clone (1000
commits). The original clone is about 30MB, but the split repo is only
4MB.

Cloning the original repo takes a solid 30+ seconds:

[1221]anarcat@curie:src130$ time git clone 
file://$PWD/security-tracker-1000.orig security-tracker-1000.orig-test
Clonage dans 'security-tracker-1000.orig-test'...
remote: Énumération des objets: 5291, fait.
remote: Décompte des objets: 100% (5291/5291), fait.
remote: Compression des objets: 100% (1264/1264), fait.
remote: Total 5291 (delta 3157), réutilisés 5291 (delta 3157)
Réception d'objets: 100% (5291/5291), 8.80 MiB | 19.47 MiB/s, fait.
Résolution des deltas: 100% (3157/3157), fait.
64.35user 0.44system 0:34.32elapsed 188%CPU (0avgtext+0avgdata 
200056maxresident)k
0inputs+58968outputs (0major+48449minor)pagefaults 0swaps

Cloning the split repo takes less than a second:

[1223]anarcat@curie:src$ time git clone 
file://$PWD/security-tracker-1000-filtered security-tracker-1000-filtered-test
Clonage dans 'security-tracker-1000-filtered-test'...
remote: Énumération des objets: 2214, fait.
remote: Décompte des objets: 100% (2214/2214), fait.
remote: Compression des objets: 100% (1190/1190), fait.
remote: Total 2214 (delta 936), réutilisés 2214 (delta 936)
Réception d'objets: 100% (2214/2214), 1.25 MiB | 22.78 MiB/s, fait.
Résolution des deltas: 100% (936/936), fait.
0.25user 0.04system 0:00.38elapsed 79%CPU (0avgtext+0avgdata 8200maxresident)k
0inputs+8664outputs (0major+3678minor)pagefaults 0swaps

So this is clearly a win, and I think it would be possible to rewrite
the history using the filter-branch command. Commit IDs would change,
but we would keep all commits and so annotate and all that good stuff
would still work.

The split-by-year bash script was too slow for my purposes: it was
taking a solid 15 seconds for each run, which meant it would have taken
9 *days* to process the entire repository.

So I tried to see if this could be optimized, so we could split the file
while keeping history without having to shutdown the whole system for
days. I first rewrote it in Python, which processed the 1000 commits in
801 seconds. This gives an estimate of 15 hours for the 68278 commits I
had locally. Concerned about the Python startup time, I then tried
golang, which processed the tree in 262 seconds, giving final estimate
of 4.8 hours.

Attached are both implementations, for those who want to reproduce my
results. Note that they differ from the original implementation in that
they have to (naturally) remove the data/CVE/list file itself otherwise
it's kept in history.

Here's how to call it:

git -c commit.gpgSign=false filter-branch --tree-filter 
'/home/anarcat/src/security-tracker/bin/split-by-year.py data/CVE/list' HEAD

Also observe how all gpg commit signatures are (obviously) lost. I have
explicitely disabled that because those actually take a long time to
compute...

I haven't tested if a graft would improve performance, but I suspect it
would not, given the sheer size of the repository that would effectively
need to be carried over anyways.

A.

-- 
Man really attains the state of complete humanity when he produces,
without being forced by physical need to sell himself as a commodity.
- Ernesto "Che" Guevara
package main

import (
	"bufio"
	"bytes"
	"io"
	"log"
	"os"
	"strconv"
	"strings"
)

func main() {
	file, err := os.Open("data/CVE/list")
	if err != nil {
		log.Fatal(err)
	}
	defer file.Close()

	var (
		line []byte
		cve  []byte
		year uint64
		year_str string
		target   *os.File
		header   bool
	)
	fds := make(map[uint64]*os.File, 20)
	scanner := bufio.NewReader(file)
	for {
		line, err = scanner.ReadBytes('\n')

		if bytes.HasPrefix(line, []byte("CVE-")) {

			cve = line
			year_str = strings.Split(string(line), "-")[1]
			year, _ = strconv.ParseUint(year_str, 0, 0)
			header = true
		} else {
			if target, ok := fds[year]; !ok {
target, err = os.OpenFile("data/CVE/list."+year_str, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
if err != nil {
	log.Fatal(err)
}
fds[year] = target
			}
			if header {
target.Write(cve)
header = false
			}
			target.Write(line)
		}
		if err != nil {
			break
		}
	}

Bug#908678: Some more thoughts and some tests on the security-tracker git repo

2018-11-09 Thread Antoine Beaupré
On 2018-09-26 14:56:16, Daniel Lange wrote:

[...]

> In any case, a repo with just the split files but no maintained history clones
> in ~12s in the above test setup. It also brings the (bare) repo down from 
> 3,3GB
> to 189MB. So the issue is really the data/CVE/list file.

So I've looked in that problem as well, four months ago:

https://salsa.debian.org/security-tracker-team/security-tracker/issues/2

In there I proposed splitting the data/CVE/list file into "one file per
CVE". In retrospect, that was a rather naive approach and yielded all
sorts of problems: there were so many files that it create problems even
for the shell (argument list too long).

I hadn't thought of splitting things in "one *file* per year". That
could really help! Unfortunately, it's hard to simulate what it would
look like *14 years* from now (yes, that's how old that repo is
already).

I can think of two ways to simulate that:

 1. generate commits to recreate all files from scratch: parse
data/CVE/list, split it up into chunks, and add each CVE in one
separate commit. it's not *exactly* how things are done now, but it
should be a close enough approximation

 2. do a crazy filter-branch to send commits to the right
files. considering how long an initial clone takes, i can't even
begin to imagine how long *that* would take. but it would be the
most accurate simulation.

Short of that, I think it's somewhat dishonest to compare a clean
repository with split files against a repository with history over 14
years and thousands of commits. Intuitively, I think you're right and
that "sharding" the data in yearly packets would help a lot git's
performance. But we won't know until we simulate it, and if hit that
problem again 5 years from now, all that work will have been for
nothing. (Although it *would* give us 5 years...)

> That said, data/DSA/list is 14575 lines. That seems to not bother git too much
> yet. Still if things get re-structured, this file may be worth a look, too.

Yeah, I haven't had trouble with that one yet either.

> To me the most reasonable path forward unfortunately looks like start a new 
> repo
> for 2019+ and "just" import the split files or single-record files as 
> mentioned
> by pabs but not the git/svn/cvs history. The old repo would - of course - stay
> around but frozen at a deadline.

In any case, I personally don't think history over those files is that
critical. We rarely dig into that history because it's so
expensive... Any "git annotate" takes forever in this repo, and running
*that* it over data/CVE/list takes tens of minutes.

That said, once we pick a solution, we *could* craft a magic
filter-branch that *would* keep history. It might be worth eating that
performance cost then. I'll run some tests to see if I can make sense of
such a filter.

> Corsac also mentioned on IRC that the repo could be hosted outside of Gitlab.
> That would reduce the pressure for some time.
> But cgit and other git frontends (as well as backends) we tested also struggle
> with the repo (which is why my company, Faster IT GmbH, used the 
> security-tracker
> repo as a very welcome test case in the first place).
> So that would buy time but not be a solution long(er) term.

Agreed. I think the benefits of hosting on gitlab outweigh the trouble
in rearchitecturing our datastore. As I said, it's not just gitlab
that's struggling with a 17MB text file: git itself has trouble dealing
with it as well, and I am often frustrated by that in my work...

A.

-- 
You are absolutely deluded, if not stupid, if you think that a
worldwide collection of software engineers who can't write operating
systems or applications without security holes, can then turn around
and suddenly write virtualization layers without security holes.
- Theo de Raadt