Bug#908678: Some more thoughts and some tests on the security-tracker git repo

2018-11-09 Thread Antoine Beaupré
On 2018-11-09 16:05:06, Antoine Beaupré wrote:
>  2. do a crazy filter-branch to send commits to the right
> files. considering how long an initial clone takes, i can't even
> begin to imagine how long *that* would take. but it would be the
> most accurate simulation.
>
> Short of that, I think it's somewhat dishonest to compare a clean
> repository with split files against a repository with history over 14
> years and thousands of commits. Intuitively, I think you're right and
> that "sharding" the data in yearly packets would help a lot git's
> performance. But we won't know until we simulate it, and if hit that
> problem again 5 years from now, all that work will have been for
> nothing. (Although it *would* give us 5 years...)

So I've done that crzy filter-branch, on a shallow clone (1000
commits). The original clone is about 30MB, but the split repo is only
4MB.

Cloning the original repo takes a solid 30+ seconds:

[1221]anarcat@curie:src130$ time git clone 
file://$PWD/security-tracker-1000.orig security-tracker-1000.orig-test
Clonage dans 'security-tracker-1000.orig-test'...
remote: Énumération des objets: 5291, fait.
remote: Décompte des objets: 100% (5291/5291), fait.
remote: Compression des objets: 100% (1264/1264), fait.
remote: Total 5291 (delta 3157), réutilisés 5291 (delta 3157)
Réception d'objets: 100% (5291/5291), 8.80 MiB | 19.47 MiB/s, fait.
Résolution des deltas: 100% (3157/3157), fait.
64.35user 0.44system 0:34.32elapsed 188%CPU (0avgtext+0avgdata 
200056maxresident)k
0inputs+58968outputs (0major+48449minor)pagefaults 0swaps

Cloning the split repo takes less than a second:

[1223]anarcat@curie:src$ time git clone 
file://$PWD/security-tracker-1000-filtered security-tracker-1000-filtered-test
Clonage dans 'security-tracker-1000-filtered-test'...
remote: Énumération des objets: 2214, fait.
remote: Décompte des objets: 100% (2214/2214), fait.
remote: Compression des objets: 100% (1190/1190), fait.
remote: Total 2214 (delta 936), réutilisés 2214 (delta 936)
Réception d'objets: 100% (2214/2214), 1.25 MiB | 22.78 MiB/s, fait.
Résolution des deltas: 100% (936/936), fait.
0.25user 0.04system 0:00.38elapsed 79%CPU (0avgtext+0avgdata 8200maxresident)k
0inputs+8664outputs (0major+3678minor)pagefaults 0swaps

So this is clearly a win, and I think it would be possible to rewrite
the history using the filter-branch command. Commit IDs would change,
but we would keep all commits and so annotate and all that good stuff
would still work.

The split-by-year bash script was too slow for my purposes: it was
taking a solid 15 seconds for each run, which meant it would have taken
9 *days* to process the entire repository.

So I tried to see if this could be optimized, so we could split the file
while keeping history without having to shutdown the whole system for
days. I first rewrote it in Python, which processed the 1000 commits in
801 seconds. This gives an estimate of 15 hours for the 68278 commits I
had locally. Concerned about the Python startup time, I then tried
golang, which processed the tree in 262 seconds, giving final estimate
of 4.8 hours.

Attached are both implementations, for those who want to reproduce my
results. Note that they differ from the original implementation in that
they have to (naturally) remove the data/CVE/list file itself otherwise
it's kept in history.

Here's how to call it:

git -c commit.gpgSign=false filter-branch --tree-filter 
'/home/anarcat/src/security-tracker/bin/split-by-year.py data/CVE/list' HEAD

Also observe how all gpg commit signatures are (obviously) lost. I have
explicitely disabled that because those actually take a long time to
compute...

I haven't tested if a graft would improve performance, but I suspect it
would not, given the sheer size of the repository that would effectively
need to be carried over anyways.

A.

-- 
Man really attains the state of complete humanity when he produces,
without being forced by physical need to sell himself as a commodity.
- Ernesto "Che" Guevara
package main

import (
	"bufio"
	"bytes"
	"io"
	"log"
	"os"
	"strconv"
	"strings"
)

func main() {
	file, err := os.Open("data/CVE/list")
	if err != nil {
		log.Fatal(err)
	}
	defer file.Close()

	var (
		line []byte
		cve  []byte
		year uint64
		year_str string
		target   *os.File
		header   bool
	)
	fds := make(map[uint64]*os.File, 20)
	scanner := bufio.NewReader(file)
	for {
		line, err = scanner.ReadBytes('\n')

		if bytes.HasPrefix(line, []byte("CVE-")) {

			cve = line
			year_str = strings.Split(string(line), "-")[1]
			year, _ = strconv.ParseUint(year_str, 0, 0)
			header = true
		} else {
			if target, ok := fds[year]; !ok {
target, err = os.OpenFile("data/CVE/list."+year_str, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
if err != nil {
	log.Fatal(err)
}
fds[year] = target
			}
			if header {
target.Write(cve)
header = false
			}
			target.Write(line)
		}
		if err != nil {
			break
		}
	}

Bug#908678: Some more thoughts and some tests on the security-tracker git repo

2018-11-09 Thread Antoine Beaupré
On 2018-09-26 14:56:16, Daniel Lange wrote:

[...]

> In any case, a repo with just the split files but no maintained history clones
> in ~12s in the above test setup. It also brings the (bare) repo down from 
> 3,3GB
> to 189MB. So the issue is really the data/CVE/list file.

So I've looked in that problem as well, four months ago:

https://salsa.debian.org/security-tracker-team/security-tracker/issues/2

In there I proposed splitting the data/CVE/list file into "one file per
CVE". In retrospect, that was a rather naive approach and yielded all
sorts of problems: there were so many files that it create problems even
for the shell (argument list too long).

I hadn't thought of splitting things in "one *file* per year". That
could really help! Unfortunately, it's hard to simulate what it would
look like *14 years* from now (yes, that's how old that repo is
already).

I can think of two ways to simulate that:

 1. generate commits to recreate all files from scratch: parse
data/CVE/list, split it up into chunks, and add each CVE in one
separate commit. it's not *exactly* how things are done now, but it
should be a close enough approximation

 2. do a crazy filter-branch to send commits to the right
files. considering how long an initial clone takes, i can't even
begin to imagine how long *that* would take. but it would be the
most accurate simulation.

Short of that, I think it's somewhat dishonest to compare a clean
repository with split files against a repository with history over 14
years and thousands of commits. Intuitively, I think you're right and
that "sharding" the data in yearly packets would help a lot git's
performance. But we won't know until we simulate it, and if hit that
problem again 5 years from now, all that work will have been for
nothing. (Although it *would* give us 5 years...)

> That said, data/DSA/list is 14575 lines. That seems to not bother git too much
> yet. Still if things get re-structured, this file may be worth a look, too.

Yeah, I haven't had trouble with that one yet either.

> To me the most reasonable path forward unfortunately looks like start a new 
> repo
> for 2019+ and "just" import the split files or single-record files as 
> mentioned
> by pabs but not the git/svn/cvs history. The old repo would - of course - stay
> around but frozen at a deadline.

In any case, I personally don't think history over those files is that
critical. We rarely dig into that history because it's so
expensive... Any "git annotate" takes forever in this repo, and running
*that* it over data/CVE/list takes tens of minutes.

That said, once we pick a solution, we *could* craft a magic
filter-branch that *would* keep history. It might be worth eating that
performance cost then. I'll run some tests to see if I can make sense of
such a filter.

> Corsac also mentioned on IRC that the repo could be hosted outside of Gitlab.
> That would reduce the pressure for some time.
> But cgit and other git frontends (as well as backends) we tested also struggle
> with the repo (which is why my company, Faster IT GmbH, used the 
> security-tracker
> repo as a very welcome test case in the first place).
> So that would buy time but not be a solution long(er) term.

Agreed. I think the benefits of hosting on gitlab outweigh the trouble
in rearchitecturing our datastore. As I said, it's not just gitlab
that's struggling with a 17MB text file: git itself has trouble dealing
with it as well, and I am often frustrated by that in my work...

A.

-- 
You are absolutely deluded, if not stupid, if you think that a
worldwide collection of software engineers who can't write operating
systems or applications without security holes, can then turn around
and suddenly write virtualization layers without security holes.
- Theo de Raadt



Re: Bug#908678: Some more thoughts and some tests on the security-tracker git repo

2018-09-27 Thread Salvatore Bonaccorso
Hi,

[not contributing right now with ideas, just giving one important
datapoint to me to the discussion]

On Wed, Sep 26, 2018 at 03:15:14PM +0200, Guido Günther wrote:
> Not necessarily. Maybe a graft would do:
> 
> 
> https://developer.atlassian.com/blog/2015/08/grafting-earlier-history-with-git/
> 
> This is IMHO preferable over history rewrites. I've used this to tie
> histories in the past. I've not used "git replace" though but
> .git/info/grafts.

FWIW on this point, for the securiy team members worklfows it is quite
importannt aspect (even admittely can be slow) to have access to
history of commits while working on their own checkouts. So that would
be a feature that in any splitup work done should be considered,
either in a rewrite-history situation or as mentioned above, or other
possibilties which will arise.

Thank you!

Regards,
Salvatore



Bug#908678: Some more thoughts and some tests on the security-tracker git repo

2018-09-26 Thread Guido Günther
Hi,
On Wed, Sep 26, 2018 at 01:56:16PM +0200, Daniel Lange wrote:
> The main issue is that we need to get clone and diff+render operations
> back into normal time frames. The salsa workers (e.g. to render a
> diff) time out after 60s. Similar time constraints are put onto other

I wonder why that is since "git diff" is pretty fast on a local
checkout. Did we ask the gitlab folks about it?

[..snip..]
> Just splitting the file will not do. We need to (unfortunately)
> somehow "get rid" of the history (delta-resolution) walks in git:

Not necessarily. Maybe a graft would do:


https://developer.atlassian.com/blog/2015/08/grafting-earlier-history-with-git/

This is IMHO preferable over history rewrites. I've used this to tie
histories in the past. I've not used "git replace" though but
.git/info/grafts.

Cheers,
 -- Guido



Bug#908678: Some more thoughts and some tests on the security-tracker git repo

2018-09-26 Thread Daniel Lange
The main issue is that we need to get clone and diff+render operations
back into normal time frames. The salsa workers (e.g. to render a
diff) time out after 60s. Similar time constraints are put onto other
rendering frond-ends. Actually you can easily get Apache to segfault
if you do not time-constrain cgi/fcgi type processes.
But that's out of scope here.

Back on topic:

Just splitting the file will not do. We need to (unfortunately)
somehow "get rid" of the history (delta-resolution) walks in git:

# test setup limits: Network bw: 200 MBit, client system: 4 core

$ time git clone https://.../debian_security_security-tracker
Klone nach 'debian_security_security-tracker' ...
remote: Counting objects: 334274, done.
remote: Compressing objects: 100% (67288/67288), done.
remote: Total 334274 (delta 211939), reused 329399 (delta 208905)
Empfange Objekte: 100% (334274/334274), 165.46 MiB | 21.93 MiB/s, 
Fertig.
Löse Unterschiede auf: 100% (211939/211939), Fertig.

real14m13,159s
user27m23,980s
sys 0m17,068s

# Run the tool already available to split the main CVE/list
# file into annual files. Thanks Raphael Geissert!
$ bin/split-by-year

# remove the old big CVE/list file
$ git rm data/CVE/list

# get the new files into git
$ git add data/CVE/list.*
$ git commit --all
[master a06d3446ca] Remove list and commit bin/split-by-year results
 21 files changed, 342414 insertions(+), 342414 deletions(-)
 delete mode 100644 data/CVE/list
 create mode 100644 data/CVE/list.1999
 create mode 100644 data/CVE/list.2000
 create mode 100644 data/CVE/list.2001
 create mode 100644 data/CVE/list.2002
 create mode 100644 data/CVE/list.2003
 create mode 100644 data/CVE/list.2004
 create mode 100644 data/CVE/list.2005
 create mode 100644 data/CVE/list.2006
 create mode 100644 data/CVE/list.2007
 create mode 100644 data/CVE/list.2008
 create mode 100644 data/CVE/list.2009
 create mode 100644 data/CVE/list.2010
 create mode 100644 data/CVE/list.2011
 create mode 100644 data/CVE/list.2012
 create mode 100644 data/CVE/list.2013
 create mode 100644 data/CVE/list.2014
 create mode 100644 data/CVE/list.2015
 create mode 100644 data/CVE/list.2016
 create mode 100644 data/CVE/list.2017
 create mode 100644 data/CVE/list.2018

# this one is fast:
$ git push

# create a new clone
$ time git clone 
https://.../debian_security_security-tracker_split_files test-clone
Klone nach 'test-clone' ...
remote: Counting objects: 334298, done.
remote: Compressing objects: 100% (67312/67312), done.
remote: Total 334298 (delta 211943), reused 329399 (delta 208905)
Empfange Objekte: 100% (334298/334298), 168.91 MiB | 21.28 MiB/s, 
Fertig.
Löse Unterschiede auf: 100% (211943/211943), Fertig.

real14m35,444s
user27m45,500s
sys 0m21,100s

--> so splitting alone doesn't help. Git is not clever enough to not run
through the deltas of not to be checked-out files.

git 2.18's git2 wire protocol could be used with server-side filtering
but that's an awful hack. Telling people to

git clone --depth 1 #(shallow)

like Guido advises is easier and more reliable for the clone use-case.
For the original repo that will take ~1.5s, for a split-by-year repo ~0.2s.

There are tools to split git files and keep the history
e.g. https://github.com/potherca-bash/git-split-file
but we'd need (to create) one that also zaps the old deltas.
So really "rewrite history" as the git folks tend to call this.
git filter-branch can do this. But it would get somewhat complex and murky
with commits that span CVE/list-year and list-year+1 which are at least 21 for
2018+2017, 19 for 2017+2016 and ~10 for previous year combos.
So I wouldn't put too much effort into that path.

In any case, a repo with just the split files but no maintained history clones
in ~12s in the above test setup. It also brings the (bare) repo down from 3,3GB
to 189MB. So the issue is really the data/CVE/list file.

That said, data/DSA/list is 14575 lines. That seems to not bother git too much
yet. Still if things get re-structured, this file may be worth a look, too.

To me the most reasonable path forward unfortunately looks like start a new repo
for 2019+ and "just" import the split files or single-record files as mentioned
by pabs but not the git/svn/cvs history. The old repo would - of course - stay
around but frozen at a deadline.

Corsac also mentioned on IRC that the repo could be hosted outside of Gitlab.
That would reduce the pressure for some time.
But cgit and other git frontends (as well as