Bug#908678: Testing the filter-branch scripts

2018-11-14 Thread Salvatore Bonaccorso
Hi,

On Wed, Nov 14, 2018 at 07:45:59PM +0100, Moritz Muehlenhoff wrote:
> On Wed, Nov 14, 2018 at 07:34:03AM +0100, Daniel Lange wrote:
> > Am 13.11.18 um 23:09 schrieb Moritz Muehlenhoff:
> > > The current data structure works very well for us and splitting the files
> > > has many downsides.
> > 
> > Could you detail what those many downsides are besides the scripts that
> > need to be amended?
> 
> Nearly all the tasks of actually editing the data require a look at the 
> complete
> data, e.g. to check whether something was tracked before, whether there's an 
> ITP
> for something, whether something was tracked as NFU in the past and lots more.

Agreed from my point of view as well, history is and contains valuable
data, we do not want to loose that. And even if researching in older
items and made changes takes time. You will even see that with time
passed people started to put more information in the respective done
changes/commits, giving rationales, notes, and additional informations.

And if that all is going to be too much hassle for the salsa
infrastructure we would need/could move the repository to somewhere
else, with the unfortunate downside on contributors from the whole
comunity.  But admitely the people regularly contributing is
overviewable.

On the agreement side I fully agree that initial clones of the repo
are a problem. It as well would be intreesting to see what git
upstream would think on that usecase and #913124 raised by Guido.

Regards,
Salvatore



Bug#908678: Testing the filter-branch scripts

2018-11-14 Thread Holger Levsen
On Wed, Nov 14, 2018 at 07:45:59PM +0100, Moritz Muehlenhoff wrote:
> Nearly all the tasks of actually editing the data require a look at the 
> complete
> data, e.g. to check whether something was tracked before, whether there's an 
> ITP
> for something, whether something was tracked as NFU in the past and lots more.

according to git log, the data goes back to 2004. Do you really need all
those 15 years of history or could we maybe make a yearly split for
(now) the first 10 years and have the last 5 years in "one"?

And then when we move into 2019 we would move 2014 to the then 11 first
years and so on... same in 2020 with 2015 then...

IMHO we should do something, else dealing with security-tracker.git will be
even more cumbersome in 5 or 10 years ahead.


-- 
cheers,
Holger

---
   holger@(debian|reproducible-builds|layer-acht).org
   PGP fingerprint: B8BF 5413 7B09 D35C F026 FE9D 091A B856 069A AA1C


signature.asc
Description: PGP signature


Bug#908678: Testing the filter-branch scripts

2018-11-14 Thread Moritz Muehlenhoff
On Wed, Nov 14, 2018 at 07:34:03AM +0100, Daniel Lange wrote:
> Am 13.11.18 um 23:09 schrieb Moritz Muehlenhoff:
> > The current data structure works very well for us and splitting the files
> > has many downsides.
> 
> Could you detail what those many downsides are besides the scripts that
> need to be amended?

Nearly all the tasks of actually editing the data require a look at the complete
data, e.g. to check whether something was tracked before, whether there's an ITP
for something, whether something was tracked as NFU in the past and lots more.

Cheers,
Moritz



Bug#908678: Testing the filter-branch scripts

2018-11-14 Thread Guido Günther
Hi,
On Tue, Nov 13, 2018 at 11:09:41PM +0100, Moritz Muehlenhoff wrote:
> On Tue, Nov 13, 2018 at 12:22:54PM -0500, Antoine Beaupré wrote:
>  > But before going through that trouble, I think we'd need to get approval
> > from the security team first, as that's quite a lot of work. I figured
> > we would make a feasability study first...
> 
> The current data structure works very well for us and splitting the files
> has many downsides.
> 
> If we can't get the repository in run on salsa in a manner that doesn't
> impact other repositories (e.g. by disabling the repository browser or
> similar), then moving the security tracker repository out of Salsa is
> the more likely solution.
> 
> Did anyone follow Guido's suggestion to report this upstream to
> get their assessment on possible optimisations?

Just in case someone takes this upstream. I've filed

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=913124

against git a couple of days ago.
Cheers,
 -- Guido



Bug#908678: Testing the filter-branch scripts

2018-11-13 Thread Daniel Lange
Am 13.11.18 um 23:09 schrieb Moritz Muehlenhoff:
> The current data structure works very well for us and splitting the files
> has many downsides.

Could you detail what those many downsides are besides the scripts that
need to be amended?



Bug#908678: Testing the filter-branch scripts

2018-11-13 Thread Moritz Muehlenhoff
On Tue, Nov 13, 2018 at 12:22:54PM -0500, Antoine Beaupré wrote:
 > But before going through that trouble, I think we'd need to get approval
> from the security team first, as that's quite a lot of work. I figured
> we would make a feasability study first...

The current data structure works very well for us and splitting the files
has many downsides.

If we can't get the repository in run on salsa in a manner that doesn't
impact other repositories (e.g. by disabling the repository browser or
similar), then moving the security tracker repository out of Salsa is
the more likely solution.

Did anyone follow Guido's suggestion to report this upstream to
get their assessment on possible optimisations?

Cheers,
Moritz



Bug#908678: Testing the filter-branch scripts

2018-11-13 Thread Antoine Beaupré
On 2018-11-13 18:14:54, Daniel Lange wrote:
>> The Python job finished successfully here after 10 hours.
> 6h40 mins here as I ported your improved logic to the python2 version :).
>
> # git filter-branch --tree-filter '/usr/bin/python2 /split-by-year.pyc' HEAD
> Rewrite 1169d256b27eb7244273671582cc08ba88002819 (68356/68357) (24226 seconds 
> passed, remaining 0 predicted)
> Ref 'refs/heads/master' was rewritten
>
> The tree-filter blows up the .git/objects store to 13G though.
> But nothing a git gc can't fix.

Ah but that's because the old repository is still in there. You need to
clone the repo in a clean copy:

git clone file://$PWD/security-tracker security-tracker-filtered

To get the minimal version, i even did that twice although I'm not sure
that's necessary.

[...]

>> I looked at splitting that file per CVE. That did not scale and just
>> created new problems. But splitting by *year* seems like a very
>> efficient switch, and I think it would be worth pursuing that idea
>> forward.
>
> The tools in bin/ would need a brush through. I.e. throw away the
> unused ones and amend the ones that are used on data/CVE/* to learn
> about the split files.

Oh yes, lots of work remains, whether we keep the history or not. That's
probably the *most* work we need to do.

But before going through that trouble, I think we'd need to get approval
from the security team first, as that's quite a lot of work. I figured
we would make a feasability study first...

a.
-- 
On reconnait la grandeur et la valeur d'une nation à la façon dont
celle-ci traite ses animaux.
- Mahatma Gandhi



Bug#908678: Testing the filter-branch scripts

2018-11-13 Thread Daniel Lange
> The Python job finished successfully here after 10 hours.
6h40 mins here as I ported your improved logic to the python2 version :).

# git filter-branch --tree-filter '/usr/bin/python2 /split-by-year.pyc' HEAD
Rewrite 1169d256b27eb7244273671582cc08ba88002819 (68356/68357) (24226 seconds 
passed, remaining 0 predicted)
Ref 'refs/heads/master' was rewritten

The tree-filter blows up the .git/objects store to 13G though.
But nothing a git gc can't fix.

> 
> I did some tests on the new git repository. Cloning the repository from
> scratch takes around 2 minutes (the original repo: 21 minutes).
Confirmed.

> So that's about it. I have not done a thorough job at checking the
> actual *integrity* of the results. It's difficult, considering CVE
> identifiers are not sequential in the data/CVE/list file, so a naive
> diff like this will fail:
> 
> $ diff -u <(cat 
> ../security-tracker-full-test-filtered-bis/data/CVE/list.{2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000,1999}
>  ) data/CVE/list | diffstat
>  list |106562 
> +--
>  1 file changed, 53281 insertions(+), 53281 deletions(-)
> 
> But at least the numbers add up: it looks like no line is lost. And
> indeed, it looks like all CVEs add up:
> 
> $ diff -u <(cat 
> ../security-tracker-full-test-filtered-bis/data/CVE/list.{2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000,1999}
>  | grep ^CVE | sort -n ) <( grep ^CVE data/CVE/list | sort -n  ) | diffstat
>  0 files changed
> 
> A cursory look at the diff seems to indicate it is clean, however.

I uploaded "my" version to https://people.debian.org/~dlange/
so people can poke the log and diffs and see whether there are any
issues left.

> I looked at splitting that file per CVE. That did not scale and just
> created new problems. But splitting by *year* seems like a very
> efficient switch, and I think it would be worth pursuing that idea
> forward.

The tools in bin/ would need a brush through. I.e. throw away the
unused ones and amend the ones that are used on data/CVE/* to learn
about the split files.



Bug#908678: Testing the filter-branch scripts

2018-11-13 Thread Antoine Beaupré
On 2018-11-12 12:22:58, Antoine Beaupré wrote:
> I'll start a run on the whole history to see if I can find any problems,
> as soon as a first clone finishes resolving those damn deltas. ;)

The Python job finished successfully here after 10 hours.

I did some tests on the new git repository. Cloning the repository from
scratch takes around 2 minutes (the original repo: 21 minutes). It is
145MB while the original repo is 1.6GB.

Running git annotate on data/CVE/list.2018 takes about 26 seconds, while
it takes basically forever to annotate the original data/CVE/list. (It's
been running for 10 minutes here already.)

So that's about it. I have not done a thorough job at checking the
actual *integrity* of the results. It's difficult, considering CVE
identifiers are not sequential in the data/CVE/list file, so a naive
diff like this will fail:

$ diff -u <(cat 
../security-tracker-full-test-filtered-bis/data/CVE/list.{2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000,1999}
 ) data/CVE/list | diffstat
 list |106562 
+--
 1 file changed, 53281 insertions(+), 53281 deletions(-)

But at least the numbers add up: it looks like no line is lost. And
indeed, it looks like all CVEs add up:

$ diff -u <(cat 
../security-tracker-full-test-filtered-bis/data/CVE/list.{2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000,1999}
 | grep ^CVE | sort -n ) <( grep ^CVE data/CVE/list | sort -n  ) | diffstat
 0 files changed

A cursory look at the diff seems to indicate it is clean, however.

I looked at splitting that file per CVE. That did not scale and just
created new problems. But splitting by *year* seems like a very
efficient switch, and I think it would be worth pursuing that idea
forward.

A.

-- 
There is no cloud, it's just someone else's computer.
   - Chris Watterson



Bug#908678: Testing the filter-branch scripts

2018-11-12 Thread Antoine Beaupré
On 2018-11-10 18:56:01, Daniel Lange wrote:
> Antoine,
>
> thank you very much for your filter-branch scripts.

you're welcome! glad it can be of use.

> I tested each:
>
> 1) the golang version:
> It completes after 3h36min:
>
> # git filter-branch --tree-filter '/split-by-year' HEAD
> Rewrite a09118bf0a33f3721c0b8f6880c4cbb1e407a39d (68282/68286) (12994 seconds 
> passed, remaining 0 predicted)
> Ref 'refs/heads/master' was rewritten
>
> But it doesn't Close() the os.OpenFile handles so ...
> all data/CVE/list. files are 0 bytes long. Sic!

Well. That explains part of the performance difference. ;)

There were multiple problems with the golang source - variable shadowing
and, yes, a missing Close(). Surprisingly, the fixed version results is
*slower* than the equivalent Python code, taking about one second per
run or 1102 seconds for the last 1000 commits. I'm at a loss as to how I
managed to make go run slower than Python here (and can't help but think
C would have been easier, again). Probably poor programming on my
part. New version attached.

[...]

> 2.1) the Python version
> You claim #!/usr/bin/python3 in the shebang, so I tried that first:
>
> # git filter-branch --tree-filter '/usr/bin/python3 
> /__pycache__/split-by-year.cpython-35.pyc' HEAD
> Rewrite 990d3c4bbb49308fb3de1e0e91b9ba5600386f8a (1220/68293) (41 seconds 
> passed, remaining 2254 predicted)
>   Traceback (most recent call last):
>   File "split-by-year.py", line 13, in 
>   File "/usr/lib/python3.5/codecs.py", line 321, in decode
> (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 5463: 
> invalid start byte
> tree filter failed: /usr/bin/python3 /__pycache__/split-by-year.cpython-35.pyc

I suspected this would be a problem, but didn't find any occurence in
the shallow clone so I forgot about it. Note that the golang version
takes great care to treat the data as binary...

> The offending commit is:
> * 990d3c4bbb - Rename sarge-checks data to something not specific to sarge, 
> since we're working on etch now.
>   Sorry for the probable annoyance, but it had to be done. (13 years ago) 
> [Joey Hess]
>
> There will be many more like this, so for Python3
> this needs needs to be made unicode-agnostic.

... so I rewrote the thing to handle only binary and tested it against
that version of the file. It seems to work fine.

> Notice I compiled the .py to .pyc which makes it
> much faster and thus well usable.

Interesting. I didn't see much difference in performance in my
benchmarks on average, but the worst-case run did improve by 150ms, so I
guess this is worth the trouble. For those who didn't know (like me)
this means running:

python -m compileall bin/split-by-year.py

Whenever the .py file changes (right?).

> 2.2) Python, when a string was a string .. Python2
> Your code is actually Python2, so why not give that a try:
>
> # git filter-branch --tree-filter '/usr/bin/python2 /split-by-year.pyc' HEAD
> Rewrite b59da20b82011ffcfa6c4a453de9df58ee036b2c (2516/68293) (113 seconds 
> passed, remaining 2954 predicted)
>   Traceback (most recent call last):
>   File "split-by-year.py", line 18, in 
> yearly = 'data/CVE/list.{:d}'.format(year)
> NameError: name 'year' is not defined
> tree filter failed: /usr/bin/python2 /split-by-year.pyc
>
> The offending commit is:
> * b59da20b82 - claim (13 years ago) [Moritz Muehlenhoff]
> | diff --git a/data/CVE/list b/data/CVE/list
> | index 7b5d1d21d6..cdf0b74dd0 100644
> | --- a/data/CVE/list
> | +++ b/data/CVE/list
> | @@ -1,3 +1,4 @@
> | +begin claimed by jmm
> |  CVE-2005-3276 (The sys_get_thread_area function in process.c in Linux 2.6 
> before ...)
> |   TODO: check
> |  CVE-2005-3275 (The NAT code (1) ip_nat_proto_tcp.c and (2) 
> ip_nat_proto_udp.c in ...)
> | @@ -34,6 +35,7 @@ CVE-2005-3260 (Multiple cross-site scripting (XSS) 
> vulnerabilities in ...)
> |   TODO: check
> |  CVE-2005-3259 (Multiple SQL injection vulnerabilities in 
> versatileBulletinBoard (vBB) ...)
> |   TODO: check
> | +end claimed by jmm
> |  CVE-2005- [Insecure caching of user id in mantis]
> |   - mantis  (bug #330682; unknown)
> |  CVE-2005- [Filter information disclosure in mantis]
>
> As you see the line "+begin claimed by jmm" breaks the too simplistic parser 
> logic.
> Unfortunately dry-running against a current version of data/CVE/list such 
> errors do not show up.
> The "violations" of the file format are transient and buried in history.

Hmm... That's a trickier one. I guess we could just pretend that line
doesn't exist and drop it from history... But I chose to buffer it and
treat it like the CVE line so it gets attached to the right file. See if
it does what you expect.

   git cat-file -p b59da20b82:data/CVE/list > data/CVE/list.b59da20b82
   split-by-year.py data/CVE/list.b59da20b82

Performance-wise, I shaved off a surprising 60ms by enclosing all the
code in a function 

Bug#908678: Testing the filter-branch scripts

2018-11-10 Thread Daniel Lange
Antoine,

thank you very much for your filter-branch scripts.

I tested each:

1) the golang version:
It completes after 3h36min:

# git filter-branch --tree-filter '/split-by-year' HEAD
Rewrite a09118bf0a33f3721c0b8f6880c4cbb1e407a39d (68282/68286) (12994 seconds 
passed, remaining 0 predicted)
Ref 'refs/heads/master' was rewritten

But it doesn't Close() the os.OpenFile handles so ...
all data/CVE/list. files are 0 bytes long. Sic!

I can reproduce that just running the golang executable
against a current checkout of data/CVE/list.

# go version
go version go1.10.3 linux/amd64
(Stretch backport golang-go 2:1.10~5~bpo9+1)

2.1) the Python version
You claim #!/usr/bin/python3 in the shebang, so I tried that first:

# git filter-branch --tree-filter '/usr/bin/python3 
/__pycache__/split-by-year.cpython-35.pyc' HEAD
Rewrite 990d3c4bbb49308fb3de1e0e91b9ba5600386f8a (1220/68293) (41 seconds 
passed, remaining 2254 predicted)
  Traceback (most recent call last):
  File "split-by-year.py", line 13, in 
  File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 5463: 
invalid start byte
tree filter failed: /usr/bin/python3 /__pycache__/split-by-year.cpython-35.pyc

The offending commit is:
* 990d3c4bbb - Rename sarge-checks data to something not specific to sarge, 
since we're working on etch now.
  Sorry for the probable annoyance, but it had to be done. (13 years ago) [Joey 
Hess]

There will be many more like this, so for Python3
this needs needs to be made unicode-agnostic.

Notice I compiled the .py to .pyc which makes it
much faster and thus well usable.

2.2) Python, when a string was a string .. Python2
Your code is actually Python2, so why not give that a try:

# git filter-branch --tree-filter '/usr/bin/python2 /split-by-year.pyc' HEAD
Rewrite b59da20b82011ffcfa6c4a453de9df58ee036b2c (2516/68293) (113 seconds 
passed, remaining 2954 predicted)
  Traceback (most recent call last):
  File "split-by-year.py", line 18, in 
yearly = 'data/CVE/list.{:d}'.format(year)
NameError: name 'year' is not defined
tree filter failed: /usr/bin/python2 /split-by-year.pyc

The offending commit is:
* b59da20b82 - claim (13 years ago) [Moritz Muehlenhoff]
| diff --git a/data/CVE/list b/data/CVE/list
| index 7b5d1d21d6..cdf0b74dd0 100644
| --- a/data/CVE/list
| +++ b/data/CVE/list
| @@ -1,3 +1,4 @@
| +begin claimed by jmm
|  CVE-2005-3276 (The sys_get_thread_area function in process.c in Linux 2.6 
before ...)
|   TODO: check
|  CVE-2005-3275 (The NAT code (1) ip_nat_proto_tcp.c and (2) 
ip_nat_proto_udp.c in ...)
| @@ -34,6 +35,7 @@ CVE-2005-3260 (Multiple cross-site scripting (XSS) 
vulnerabilities in ...)
|   TODO: check
|  CVE-2005-3259 (Multiple SQL injection vulnerabilities in 
versatileBulletinBoard (vBB) ...)
|   TODO: check
| +end claimed by jmm
|  CVE-2005- [Insecure caching of user id in mantis]
|   - mantis  (bug #330682; unknown)
|  CVE-2005- [Filter information disclosure in mantis]

As you see the line "+begin claimed by jmm" breaks the too simplistic parser 
logic.
Unfortunately dry-running against a current version of data/CVE/list such 
errors do not show up.
The "violations" of the file format are transient and buried in history.

Best,
Daniel