Re: [PATCH] limit diff effort to fix performance issue

Johan Corveleyn Wed, 09 Jun 2021 08:17:55 -0700

On Tue, Jun 8, 2021 at 3:29 PM Nathan Hartman <hartman.nat...@gmail.com> wrote:
>
> On Tue, Jun 8, 2021 at 5:55 AM Stefan Sperling <s...@elego.de> wrote:
> >
> > On Tue, Jun 08, 2021 at 01:45:00AM -0400, Nathan Hartman wrote:
> > > In order to do some testing, I needed some test data that reproduces
> > > the issue; since stsp can't share the customer's 100MB XML file, and
> > > we'd probably want other inputs or sizes anyway, I wrote a program
> > > that attempts to generate such a thing. I'm attaching that program...
> > >
> > > To build, rename to .c extension and, e.g.,
> > > $ gcc gen_diff_test_data.c -o gen_diff_test_data
> > >
> > > To run it, provide two parameters:
> > >
> > > The first is a 'seed' value like you'd provide to a pseudo random
> > > number generator at init time.
> > >
> > > The second is a 'length' parameter that says how long (approximately)
> > > you want the output data to be. (The program nearly always overshoots
> > > this by a small amount.)
> > >
> > > Rather than using the system's pseudo random number generator, this
> > > program includes its own implementation to ensure that users on any
> > > system can get the same results when using the same parameters. So if
> > > different people want to test with the same sets of input, you only
> > > have to share 2 numbers, rather than send each other files >100MB of
> > > useless junk.
> > >
> > > Example: Generate two files of approx 100 MB, containing lots of
> > > differences and diff them:
> > >
> > > $ gen_diff_test_data 98 100m > one.txt
> > > $ gen_diff_test_data 99 100m > two.txt
> > > $ time diff one.txt two.txt > /dev/null
> > >
> > > With the above parameters, it takes my system's diff about 50 seconds
> > > to come up with something that looks reasonable at a glance; svn's
> > > diff has been crunching away for a while now...
> >
> > Thank you Nathan, this is incredibly useful!
> >
> > Would you consider committing this tool to our repository, e.g. somewhere
> > within the tools/dev/ subtree?
>
>
> Sure, done in r1890601.
>
> It's in tools/dev/gen-test-data/gen_diff_test_data.c.
>
> I added the gen-test-data directory in case we want to add other
> sample data generators in the future.

As for test data, I just remembered something: in 2015 Bert developed
a tool called "AnonymizedFileDumper", after some discussions we had
during a hackathon and on IRC (related to blame and diff performance).
With this tool one can create a dump file from (part of) a repository,
with all text lines replaced by their CRC32 checksum (so identical
lines remain identical, but other than that the actual information is
mostly gone). If you combine this with svndumpfilter for stripping out
the log messages and the author names, I think it's pretty much
stripped of all sensitive information (I remember I asked Bert to add
'eliding autor names and log messages' as extra features of his dumper
tool, but don't remember whether he eventually got around to that). It
is the perfect tool for creating test data out of real repositories
out there, without leaking company data, so other devs can take a
look.

I don't have the tool handy anymore, but I just searched the IRC logs
and found some mentions of it here:
https://colabti.org/irclogger/irclogger_log_search/svn-dev?search=nonymized&action=search&timespan=20150101-20151231&text=checked

The binary download link still works, but it's a Windows binary. I
don't know if the sources are still available from the sharpsvn
repository.

@Stefan: maybe you can use this to create an anonymized test case out
of the 100 MB XML file of the elego client? You'd need a Windows
machine if you want to use that binary from 2015.

@Bert: WDYT? Are the sources still out there? Maybe this can be ported
to "plain old svn test tools" (perhaps with the help of someone here
that can spend some time on it)?
(and hello by the way :-))

--
Johan

Re: [PATCH] limit diff effort to fix performance issue

Reply via email to