Re: removing toxic emailers

2021-04-18 Thread Eric S. Raymond
Ian Lance Taylor via Gcc :
> This conversation has moved well off-topic for the GCC mailing lists.
> 
> Some of the posts here do not follow the GNU Kind Communication
> Guidelines (https://www.gnu.org/philosophy/kind-communication.en.html).
> 
> I suggest that people who want to continue this thread take it off the
> GCC mailing list.
> 
> Thanks.
> 
> Ian

Welcome to the consequences of abandoning "You shall judge by the code alone."

This is what it will be like, *forever*, until you reassert that norm.
-- 
        http://www.catb.org/~esr/";>Eric S. Raymond




Re: removing toxic emailers

2021-04-15 Thread Eric S. Raymond
Ian Lance Taylor :
> Patronizing or infantilizing anybody doesn't come into this at all.

I am not even *remotely* persuaded of this.  This whole attitude that if
a woman is ever exposed to a man with less than perfect American
upper-middle-class manners it's a calamity requiring intervention
and mass shunning, that *reeks* of infantilizing women.

> We want free software to succeed.  Free software is more likely to
> succeed if more people work on it.  If you are a volunteer, as many
> are, you can choose to spend your time on the project where you have
> to short-stop unwelcome advances, where you are required to deal with
> "men with poor social skills."  Or you can choose to spend your time
> on the project where people treat you with respect.  Which one do you
> choose?

The one where your expected satisfaction is higher, with boorishness
from autistic males factored in as one of the overheads.  Don't try to
tell me that's a deal-killer, I've known too many women who would
laugh at you for that assumption.

> Or perhaps you have a job that requires you to work on free software.
> Now, if you work on a project where the people act like RMS, you are
> being forced by your employer to work in a space where you face
> unwelcome advances and men who have "trouble recognizing boundaries."
> That's textbook hostile environment, and a set up for you to sue your
> employer.  So your employer will never ask anyone to work on a project
> where people act like that--at least, they won't do it more than once.

Here's what happens in the real world (and I'm not speculating, I was
a BoD member of a tech startup at one time, stuff like this came up).
You say "X is being a jerk - can I work on something else?"  Your
employer, rightly terrified of the next step, is not going to "force"
you to do a damn thing. He's going to bend over backwards to
accommodate you.

> (Entirely separately, I don't get the slant of your whole e-mail.  You
> can put up with RMS despite the boorish behavior you describe.  Great.
> You're a saint.  Why do you expect everyone else to be a saint?

I'm no saint, I'm merely an adult who takes responsibility for my own
choices when dealing with people who have minimal-brain-damage
syndromes.  OK, I have probably acquired a bit more tolerance for
their quirks than average from long experience, but I don't believe I'm
an extreme outlier that way.

What I am pushing for is for everyone to recognize that *women are
adults* - they have their own agency and are in general perfectly
capable of treating an RMS-class jerk as at worst a minor annoyance.

Behaving as though he's some sort of icky monster who should be
shunned by all right-thinking people and taints everything he touches
is ... just unbelievably disconnected from reality.  Bizarre
neo-Puritan virtue signaling of no help to anyone.

If I needed more evidence that many Americans lead pampered,
cossetted, hyper-insulated lives that require them to make up their
own drama, this whole flap would be it.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: removing toxic emailers

2021-04-15 Thread Eric S. Raymond
Christopher Dimech via Gcc :
> The commercial use of free software is our hope, not our fear.  When people
> at IBM began to come to free software, wanting to recommend it and use it,
> and maybe distribute it themselves or encourage other people to distribute
> it for them, we did not criticise them for not being non-profit virtuous
> enough, or said "we are suspicious of you", let alone threatening them.

Actually, some of us did *exactly* those things late in the last century.

One of the challenges I faced in my early famous years was persuading
the hacker culture as a whole to treat the profit-centered parts of the
economy as allies rather than enemies.

I won't say that a *majority* of us were resistent to this, but I
did have to work hard on the problem for a while, between 1997
and about 2003.
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond




Re: removing toxic emailers

2021-04-15 Thread Eric S. Raymond
David Malcolm :
> > I will, however, point out that it is a very *different* point from
> > "RMS has iupset some people and should therefore be canceled".
> 
> Eric: I don't know if you're just being glib, or you're deliberately
> trying to caricature those of us who are upset by RMS's behavior.

My intent was not caricature.  I was being dismissive and snarky
because I genuinely consider the personality complaints against RMS to
be pretty trivial.  Not the managerial ones Joseph Myers listed; those
are serious.  But they're not the cause of the current ruckus.

To make the "triviality" point in the most forceful possible way, I
will take the bull by the horns and directly address RMS's behavior towards
women.  And I will reveal a few things that I haven't talked about in
public for 40 years.

I've known RMS since 1979; I'm fully aware of how obnoxious he can be
towards both men and women. There have been occasions on which I have
thought the state of the universe would have been improved if he'd
gotten a swift slap in the face.

In fact, the first or second time I met him face to face it was
because he was rather determinedly pursuing my then-girlfriend.
A hostile witness might have said he was creeping on her, though
that slang for it wouldn't be invented until much later.

I think an explanation of how how I reasoned about that situation has
some value in light of the current attempt to ostracize RMS.

I paid very careful attention to whether my girlfriend appeared to
need any help dealing with him. I regarded her as an adult fully
capable of making her own decisions.  One of those decisions could
have been to slap his face.  If a more severe sanction had been
required, and she had yelled for help, I would cheerfully have
punched his lights out.

No fisticuffs were required.  She gently discouraged him, and we both
established friendly relations with him.  In later years RMS and I
remained fairly close long after I broke up with that girlfriend.  He
made passes at at least two of my later girlfriends that I know of,
including the woman I am still married to.  In all cases, I trusted
these ladies to handle the situation like adults, and they did.  It
really would not have occurred to me to do otherwise.

I hear a lot of talk about RMS's behavior towards women being some sort
of vast horrible transgression that will drive all women everywhere to
flee from ever being contributors to FSF projects.  To me this seems
just silly, and very infantilizing of women in general.  My
girlfriends were emtirely able to

(1) short-stop his advances when they became unwelcome

(2) understand that some men have poor social skills and
trouble recognizing boundaries,

(3) and *stay on friendly terms with him anyway*.

I mean I saw this not just more than once, but every single time it
came up.

I don't assume that any adult female is incapable of these things; I
respect women as fully capable of asserting and defending their
interests, I *expect* women to do that, and I thus consider a lot of the
white-knighting on their behalf to be at best empty virtue signaling
and at worst a cover for much more discreditable motives.

Of course, he offends men too.  When I deal with RMS, I know that I'm
going to have to cope with a certain amount of unpleasantness because
he has autism-like deficits amplified by some unfortunate personal
history.  Yes.  So what?  He's one of my oldest friends anyway.  He
has many admirable qualities; I respect and value him even when I have
to argue with him.  And I can work with him when I need to.

Why in the *hell* should I assume anyone with female genitalia is
incapable of doing the same?  More to the point, why is anybody else
making such a silly, reductive assumption and then turning it into a
galloping moral panic that somehow justifies stoning RMS and driving
him out of the village?

*grumble* Get *over* yourselves.  You want to be "welcoming" to
women?  Don't patronize or infantilize them - respect their ability to
tell off RMS for themselves *and then keep working with him*!
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: removing toxic emailers

2021-04-15 Thread Eric S. Raymond
Adrian via Gcc :
> Eric S. Raymond :
> > there is actually a value conflict between being "welcoming" in that
> sense and the actual purpose of this list, which is to ship code.
> 
> Speaking as a "high functioning autist", I'm aware of the difficulties that
> some of us have with social interactions - and also that many of us
> construct a persona or multiple personae to interact with others, a
> phenomenon known as "masking".
> 
> I understand why "Asshole" can function as a viable mask for many people,
> because there are cultures where it's tolerated, particularly in
> remote-working groups like mailing lists, where physical altercations are
> unlikely and no-one has to confront the results of their interactions with
> others if they don't want to.
> 
> It doesn't necessarily follow that "smart" == "asshole" though.

I did not intend that claim.

I intended the weaker observation that driving away a large number of
smart autistic assholes (and non-assholes with poor social skills)
is not necessarily a good trade for the people the project might
recruit by being "more welcoming".

Possibly that *would* be a good trade.  I have decades of experience
that makes me doubt this.  I think the claim needs to be examined
skeptically, not just uncritically accepted because we value being
"nice".

In general, I think efforts to guilt-bomb hackers into being "more
inclusive" should be resisted without a clear grasp on what we might
be throwing away by accepting them.  Just because you live inside a
culture doesn't mean you can predict what mutating its assumptions
will do to it, and we have work to do that should not be casually
disrupted.

Note: I am not an autist myself, so I'm not guarding my own flanks
here.  I'm sort of autist-sympathetic, in that I think it is a good
thing autists can join the hacker culture and have a place where their
quirks are useful and tolerated.  I would be a little sad if that were
lost.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: removing toxic emailers

2021-04-15 Thread Eric S. Raymond
Paul Koning via Gcc :
> > On Apr 14, 2021, at 4:39 PM, Ian Lance Taylor via Gcc  
> > wrote:
> > So we don't get the choice between "everyone is welcome" and "some
> > people are kicked off the list."  We get the choice between "some
> > people decline to participate because it is unpleasant" and "some
> > people are kicked off the list."
> > 
> > Given the choice of which group of people are going to participate and
> > which group are not, which group do we want?
> 
> My answer is "it depends".  More precisely, in the past I would have
> favored those who decline because the environment is unpleasant --
> with the implied assumption being that their objections are
> reasonable.  Given the emergency of cancel culture, that assumption
> is no longer automatically valid.

I concur on both counts.

You (the GCC project) are no longer in a situation where any random
person saying "your environment is hostile" is a reliable signal of a
real problem.  Safetyism is being gamed by outsiders for purposes that
are not yours and have nothing to do with shipping good code.

Complaints need to be discounted accordingly, to a degree that would
not have been required before the development of a self-reinforcing
culture of complaint and rage-mobbing around 2014.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: removing toxic emailers

2021-04-15 Thread Eric S. Raymond
Joseph Myers :
> On Wed, 14 Apr 2021, Eric S. Raymond wrote:
> 
> > I'm not judging RMS's behavior (or anyone else's) one way or
> > another. I am simply pointing out that there is a Schelling point in
> > possible community norms that is well expressed as "you shall judge by
> > the code alone".  This list is not full of contention from affirming
> > that norm, but from some peoples' attempt to repudiate it.
> 
> Since RMS, FSF and GNU are not contributing code to the toolchain and 
> haven't been for a very long time, the most similar basis to judge them 
> would seem to be based on their interactions with toolchain development.  
> I think those interactions generally show that FSF and GNU have been bad 
> umbrella organizations for the toolchain since at least when the GCC 4.4 
> release was delayed waiting for a slow process of developing the GCC 
> Runtime Library Exception.

I do not have standing to argue this point.

I will, however, point out that it is a very *different* point from
"RMS has iupset some people and should therefore be canceled".
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: removing toxic emailers

2021-04-14 Thread Eric S. Raymond
Nathan Sidwell :
> The choice to /not/ have a policy for ejecting jerks has serious costs. One
> of those costs is the kind of rancorous dispute that has been
> burning like a brushfire on this list the last few weeks.

The situation isn't that symmetrical.  The brushfire didn't happen when it
was a norm here that off-list behavior was not the list's business.  It
only came about when some people decided that norm should no longer apply.

I'm not judging RMS's behavior (or anyone else's) one way or
another. I am simply pointing out that there is a Schelling point in
possible community norms that is well expressed as "you shall judge by
the code alone".  This list is not full of contention from affirming
that norm, but from some peoples' attempt to repudiate it.

(For those of you unfamilar with the concept, a Schelling point is
one of natural equilibrium in a two- or molti-player game, such that
when you move away from it all parties' decision costs go way up.)
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: removing toxic emailers

2021-04-14 Thread Eric S. Raymond
Nathan Sidwell :
> I'd just like to eject the jerks, because they make the place unwelcoming.

I understand the impulse.  The problem is that there is actually a value
conflict between being "welcoming" in that sense and the actual purpose
of this list, which is to ship code.

It's a much more direct conflict in the hacker culture than elsewhere
because so many potential contributors are high-functioning autists.
That makes the downstream consequences of politeness enforcement a lot more
damaging to the project's ability to ship code than they would otherwise be.

There is a hypothetical world, of course, in which jerks and assholes
are such a huge problem that they interfere measurably with shipping
code.  But contemplete the amount of angry verbiage on this list
recently from people who could have been using their fingers typing
code, and I think it's clear that the amount of social friction
oroduced by attempts to eject the jerks will be far higher than if
you simply continued to tolerate them.
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond




Re: removing toxic emailers

2021-04-14 Thread Eric S. Raymond
Nathan Sidwell :
> Do we have a policy about removing list subscribers that send abusive or
> other toxic emails?  do we have a code of conduct?  Searching the wiki or
> website finds nothing.  The mission statement mentions nothing.

I'm not a GCC insider, but I know a few things about the social
dynamics of voluntarist subcultures. You might recall I wrote a book
about that once.

The choice to have a policy for ejecting jerks has serious costs.
One of those costs is the kind of rancorous dispute that has been
burning like a brushfire on this list the last few weeks.  Another,
particularly serious for hackers - is that such a policy is hostile to
autists and others who have poor interaction skills but can ship good
code.  This is a significant percentage of your current and future
potential contributors, enough that excluding them is a real problem.

Most seriously: the rules, whatever they are, will be gamed by people
whose objectives are not "ship useful software". You will be fortunate
if the gamers' objectives are as relatively innocuous as "gain points
in monkey status competition by beating up funny-colored monkeys";
there are much worse cases that have been known to crash even projects
with nearly as much history and social inertia as this one.

Compared to these costs, the overhead of tolerating a few jerks and
assholes is pretty much trivial.  That's hard to see right now because
the jerks are visible and the costs of formal policing are
hypothetical, but I strongly advise you against going down the Code of
Conduct route regardless of how fashionable that looks right now.  I
have forty years of observer-participant anthropology in intentional
online communities, beginning with the disintegration of the USENET
cabal back in the 1980s, telling me that will not end well.

You're better off with an informal system of moderator fiat and
*without* rules that beg to become a subject of dispute and
manipulation. A strong norm about off-list behavior and politics being
out of bounds here is also helpful.

You face a choice between being a community that is about shipping code
and one that is embroiled in perpetual controversy over who gets to
play here and on what terms.  Choose wisely.
-- 
        http://www.catb.org/~esr/";>Eric S. Raymond




The dust seems to have settled from the repository conversion

2020-09-26 Thread Eric S. Raymond
The dust seems to have settled from the GCC repository conversion.  I
haven't seen any complaints about the conversion since it was
finalized in January, so I'm gathering there have not been any
significant problems with it.

Unfortunately, it left *me* with a problem.

If you're on this list, more than likely you have a full-time job that
pays you for working on open-source code.  Twenty years ago I sold the
business world on the value of open-source shared infrastructure, so
you can partly thank me for the fact that you have that option.

Ironically, I myself have benefitted very little from that successful
persuasion, because the work I do is not closely enough tied to
anything a corporation knows it can monetize.  Who has a business case
for developing something like reposurgeon?

I spent most of a year - thousands of hours - focusing on the
technical issues associated with the GCC conversion.  Because I'm not on
salary anywhere, paying bills and not having steady income during that
time blew a pretty large hole in my savings account.  Now my house
needs a new roof, and I have medical bills, and things are looking
rather grim.

This wasn't the first public infrastructure project I've worked on,
and it certainly won't be the last.  Reposurgeon, GPSD, NTPsec, giflib
- if you have found my work valuable and it gives you confidence that
I will continue to do useful things, please subscribe at one of these
places:

https://www.subscribestar.com/esr

https://www.patreon.com/esr

Finally, be aware that I am not the only person inn this sort of
situation.  If you feel motivated to tackle the more general problem of
load-bearing Internet people without salaries, please look at

http://loadsharers.org

take the pledge, and find two load-bearers to support who aren't me.
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond


Re: Help with new GCC git workflow...

2020-01-15 Thread Eric S. Raymond
Richard Biener :
> > I like to write really fine-grained commits when I'm developing, then
> > squash before pushing so the public repo commits always go from "tests
> > pass" to "test pass".  That way you can do clean bisections on the
> > public history.
> 
> The question is wheter one could achieve this with branches?  That is,
> have master contain a merge commit from a branch that contains the
> fine-grained commits?  Because for forensics those can be sometimes
> useful.

Of course you can do this.

Git gives you a number of different possbilities here. You get to chose
based onm how you like your histiry to look.

Discussion of my choice is here:

https://blog.ntpsec.org/2017/04/09/single-head-provable-steps.html
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Help with new GCC git workflow...

2020-01-14 Thread Eric S. Raymond
Peter Bergner :
> At this point, I get a little confused. :-)  I know to submit my patch
> for review, I'll want to squash my commits down into one patch, but how
> does one do that?  Should I do that now or only when I'm ready to
> push this change to the upstream repo or ???  Do I need to even do that?

If you want to squash a commit series, the magic is git rebase -i. You
give that a number of commits to look back at at and you'll get a buffer
instructing you how to squash and shuffle that series.  You'll also be able
to edit the commit message.

I like to write really fine-grained commits when I'm developing, then
squash before pushing so the public repo commits always go from "tests
pass" to "test pass".  That way you can do clean bisections on the
public history.

> Also, when I'm ready to push this "change" upstream to trunk, I'll need
> to move this over to my master and then push.  What are the recommended
> commands for doing that?

There are a couple of ways.  I usually squash as described above
then use "git cherry-pick".  But that's because I have philosophical
reasons to avoid long-lives branches.

>   I assume I need to rebase my branch to
> current upstream master, since that probably has moved forward since
> I checked my code out.

Yes, in general you'll want to do that.

> Also, at what point do I write my final commit message, which is different
> than the (possibly simple) commit messages above?  Is that done after I've
> pulled my local branch into my master?  ...or before?  ...or during the
> merge over?

I do it at rebase -i time along with the squash of the series.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: git conversion in progress

2020-01-11 Thread Eric S. Raymond
Thomas Koenig :
> Hm... I just hope this is a one-time effect, and isn't an indication
> that git uses much more resources, server-side, so the current
> infrastructure is not up to the task.  Is git that much more
> resource hungry than svn? Or is this unrelated?

Almost certanly unrelated. In normal use git is *spectacularly* faster than
SVN on equivalent operations. 
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2020-01-10 Thread Eric S. Raymond
Bernd Schmidt :
> I was on the fence for a long time, since I felt that the rewritten
> reposurgeon was still somewhat unproven.

And that was a fair criticism for a short while, until the first compare-all
verification on the GCC history came back clean.

The most difficult point in the whole process for me was in late
November.  That was when I faced up to the fact that, while I had a
Subversion dump reader that was 95% good, (1) that 5% could
disqualify it for this complex a history, and (2) I wasn't going to
be able to solve that last 5% without tearing down most of the reader
and rebuilding it.

The problem was that I'd been patching the dump reader to fix edge
cases for too long, and the code had rigidified. Too many auxiliary
data structures with partially overlapping semabtics - I couldn't
change anything without breaking everything. Which is the universe's
way of telling you it's time for a rewrite.

Of course the risk was that I wouldn't get that rewrite done in time
for deadline. But I had two assets that mitigated the risk. One was
a couple of very sharp collaborators, Julien Rivaud and Daniel Brooks
(and later another, Edward Cree). The other was having a really good
test suite, and a well-established procedure for integrating new
tests that jsm and rearnshaw were able to use.

It was (as the Duke of Wellington famously said) a damned near-run
thing. With all those advantages, if I had waited even a week longer
to make the crucial scrap-and-rebuild decision, the new reader might
have landed too late.

There's a lesson in here somewhere. When I figure out what it is, I'll
put it in my next book.
-- 
            http://www.catb.org/~esr/";>Eric S. Raymond




Re: Rescue of prehistoric GCC versions

2020-01-09 Thread Eric S. Raymond
Joseph Myers :
> I want to consider the conversion machinery essentially frozen at this 
> point and not to add any new features not present in the conversion now 

Very well, I won't push the inegration change for those commits.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2020-01-09 Thread Eric S. Raymond
Richard Earnshaw (lists) :
> I want to also take this opportunity to thank Maxim for the work he has
> done.  Having that fallback option has meant that we could press harder for
> a timely solution and has also driven several significant improvements to
> the overall result.  I do not think we would have achieved as good a result
> overall if he hadn't developed his scripts.

Yes. Reposurgeon's ChangeLog processing, in particular, was significantly
improved using lessons learned from maxim's scripts.
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond




Rescue of prehistoric GCC versions

2020-01-09 Thread Eric S. Raymond
I have been able to rescue or reconstruct from patches the following
prehisoric GCC releases

gcc-0.9
gcc-1.21
gcc-1.22
gcc-1.25
gcc-1.26
gcc-1.27
gcc-1.28
gcc-1.35

gcc-1.36
gcc-1.37.1
gcc-1.38
gcc-1.39
gcc-1.40
gcc-1.41
gcc-1.42
gcc-2.1
gcc-2.2.2
gcc-2.3.3
gcc-2.4.5
gcc-2.5.8
gcc-2.6.3
gcc-2.7.2
gcc-2.8.0

The gap in the sequence represents the beginning of the repository
history; r3 = gcc-1.36.

The 0.9 to 0.35 tarballs can be glued to the front of the
history, one commit each, with a firewall commit containing a deleteall
to keep the content from leaking forward.  This is an issue because
the early parts of the repo don't have complete trees.

I'm now testing a conversion on the Great Beast that puts these in
place. If all goes well I will push this capability to the public
conversion repository later today.

You can audit the reconstruction process by reading the script I wrote
to automate it:

https://gitlab.com/esr/gcc-conversion/blob/master/ancients

Unfortunately, I was only able to find valid patch chains to three
releases that don't have complete tarballs.

If anyone else can scrounge up materials that could help complete
the fossil sequence, now would be a really good time for that.  We
have only three days at most left to integrate them.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond

The object of life is not to be on the side of the majority, but to
escape finding oneself in the ranks of the insane.
   -- Marcus Aurelius


Re: GIT conversion: question about tags & release branches

2020-01-09 Thread Eric S. Raymond
Martin Liška :
> > Anyway, please check Joseph's next candidate to see if this shows what you 
> > expect -- I think it should be out later today.
> 
> I'll check it once it's published.

Everybody: time is growing short before the final conversion, so if you
see anything that looks wrong or anomalous please send up a rocket
*immediately*.  The faster you let us know, the more likely  it is we'll
be able to nip in with a fix while that is still possible.
-- 
        http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2020-01-08 Thread Eric S. Raymond
Maxim Kuvyrkov :
> Once gcc-reparent conversion is regenerated, I'll do another round of 
> comparisons between it and whatever the latest reposurgeon version is.

Thanks, Maxim. Those comparisons have been very helpful to Joseph and
Richard and to the reposurgeon devteam as well.

They use your feedback to find places where their comment-processing
scripts could be improved; we've used it learn what additional
oddities in ChangeLogs we need to be able to handle automatically.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-30 Thread Eric S. Raymond
Joseph Myers :
> To me, that indicates that using a conversion tool that is conservative in 
> its heuristics, and then selectively applying improvements to the extent 
> they can be done safely with manual review in a reasonable time, is better 
> than applying a conversion tool with more aggressive heuristics.

There's a more general point here, which I'm developing in my
book-in-progress.

Clean data-conversion problems can be done algorithmically without a
human in the loop.  Messy data-conversion problems need judgment
amplifiers.

Maxim's scripts try to treat a messy conversion problem as though it
were a clean one. Maxim is pretty sharp, so this almost works. Almost.
But the failure mode is predictable - overinterpreting badly-formed
input leads to plausible garbage on output.  

When this happens, it's the Goddess Eris's way of telling you that
there needs to be human judgment in the loop.  Instead of trying to
automate it out, you should be building tools that partion the process 
into things a computer does well, driven by choices a human makes well.

This is a point that needs making because programmers thrown at messy
conversion problems tend to be more fixated on achieving full
automation than they perhaps ought to be.

Elswhere I have written of Zeno tarpits:
http://esr.ibiblio.org/?p=6772 Subversion dump streams are not quite a
Zeno tarpit - they actually obey something that has the effect of a
formal specification - but ChangeLog parsing is.

> The issues with the reposurgeon conversion listed in Maxim's last comments 
> were of the form "reposurgeon is being conservative in how it generates 
> metadata from SVN information".  I think that's a very good basis for 
> adding on a limited set of safe improvements to authors and commit 
> messages that can be done reasonably soon and then doing the final 
> conversion with reposurgeon.

The flip side of this is that Joseph has been making intelligent and
realistic suggestions for how to improve reposurgeon.  That is
*invaluable* - it captures knowledge that will make future comparisons
easier and better.

Software engineers (outside of a few AI specialists) don't ordinarily
think of themselves as being in the knowledge-capture business. But
it's a useful perspective to cultivate.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Git conversion: fixing email addresses from ChangeLog files

2019-12-29 Thread Eric S. Raymond
Richard Earnshaw (lists) :
> Weak in the sense that it isn't proof given that the user name is
> partially redacted.  There's nothing in the gcc archives that gives a
> full name either, unfortunately.
> 
> Yes, it's the most likely match, but there's still an element of doubt.
> 
> R.

https://groups.google.com/forum/#!msg/comp.databases.sybase/Uz8ICef9Qr8/uPwanH6is60

If you open his message to Michel Peppler, you'll see a sig block that
says:

 bjo...@planetarion.com  Bjørn Wennberg, Fifth Season AS

It's him, yep.  Be sure to get the ø right what you fill in the name. :-)
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-29 Thread Eric S. Raymond
Joseph Myers :
> The case you mention is one where there was a merge to a branch not from 
> its immediate parent but from an indirect parent.  I don't think it would 
> be hard to support detecting such merges in reposurgeon.

We're working on it.

> This is an example where the originally added ChangeLog entry was 
> malformed (had the date in the form "2004-0630"), so a conservatively safe 
> approach was taken of using the committer rather than trying to guess what 
> a malformed ChangeLog entry means and risk extracting nonsense.
> 
> I expect other cases are being similarly careful in cases where there was 
> a malformed ChangeLog entry or a commit edited ChangeLog entries by other 
> authors so leaving its single-author nature ambiguous.  Parsing 
> ChangeLogs, especially where malformed entries are involved, is inherently 
> a heuristic matter.

As Joseph says, one of reposurgeon's design principles is "First, do no harm."

And yes, changelogs are full of malformations and junk like this. I
saw and dealt with a lifetime's worth while converting the Emacs
history from bzr to git.

If you try to interpret any random garbage in, you will assuredly
get garbage out when you least expect it. Often the cost of this 
sort of mistake is not fully realized until it is far too late
for correction.  This is *why* reposurgeon is conservative.

The correct thing for reposurgeon to do is flag unparseable entry
headers for human intervention, and as of today it does that.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Git conversion: fixing email addresses from ChangeLog files

2019-12-29 Thread Eric S. Raymond
Richard Earnshaw (lists) :
> Also, for this one:
> 
> #  "47044": "",
> 
> There's some (relatively weak) evidence that this is Bjørn Wennberg (eg
> https://groups.google.com/forum/#!msg/comp.databases.sybase/Uz8ICef9Qr8/uPwanH6is60J),
> but in the absence of stronger evidence, I'm going to just put bjornw as
> the name.

What's weak about that?  The full email address matches.  Un;rdd you
think there are two hackers nameed Bjorn, with a last initial of W,
running around using the same email address, I think we have a winner.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: The far past of GCC

2019-12-29 Thread Eric S. Raymond
Mark Wielaard :
> Apparently less complete, but there is also
> https://ftp.gnu.org/old-gnu/gcc/
> Which does have some old diff files to reconstruct some missing versions.

There are quite a few ancient preserved release tarballs out there
Here is the list of reconstructable pre-r3 releases as as I now know it:

0.9 ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-0.9.tar.bz2
1.21ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-1.21.tar.bz2
1.22ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-1.22.tar.bz2
1.23ftp://gcc.gnu.org/pub/gcc/old-releases/patches/gcc.diff-1.23-1.24.bz2
1.24ftp://gcc.gnu.org/pub/gcc/old-releases/patches/gcc.diff-1.24-1.25.bz2
1.25ftp://gcc.gnu.org/pub/gcc/old-releases/patches/gcc.diff-1.25-1.26.bz2
1.26ftp://gcc.gnu.org/pub/gcc/old-releases/patches/gcc.diff-1.26-1.27.bz2
1.27ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-1.27.tar.bz2
1.28ftp://gcc.gnu.org/pub/gcc/old-releases/patches/gcc.diff-1.28-1.29.bz2
1.29ftp://gcc.gnu.org/pub/gcc/old-releases/patches/gcc.diff-1.29-1.30.bz2
1.30ftp://gcc.gnu.org/pub/gcc/old-releases/patches/gcc.diff-1.30-1.31.bz2
1.31ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-1.31.tar.bz2
1.32ftp://gcc.gnu.org/pub/gcc/old-releases/patches/gcc.diff-1.31-1.32.bz2
1.33ftp://gcc.gnu.org/pub/gcc/old-releases/patches/gcc.diff-1.32-1.33.bz2
1.34ftp://gcc.gnu.org/pub/gcc/old-releases/patches/gcc.diff-1.32-1.34.bz2
1.35ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-1.35.tar.bz2

It looks like the relevant bits of 
ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-[12]
and ftp://sourceware.org/pub/gcc/old-releases/gcc-[12]

Incorporating these will be easy. What I would do is write script that does 
this:

(a) checks to see if each tarball is mirrored locally

(b) if not, fetches it, applying forward or back diffs from the nearest whole
version as required.

(c) generates a sequence of reposurgeon incorporate commands to be included
un the main lift script

sbb says r3 is 1.36.  I doubt r1 and r2 are anything other than
Subversion directory creations, but people with easier access than me
should check.

After this life gets a little trickier. We have the following tarballs
that might be of interest:

1.36r3  ftp://gcc.org/pub/gcc/old-releases/gcc-1/gcc-1.36.tar.bz2
1.37?   ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-1.37.tar.bz2
1.38?   ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-1.38.tar.bz2
1.39?   ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-1.39.tar.bz2
1.40?   ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-1.40.tar.bz2
1.41?   ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-1.41.tar.bz2
1.42?   ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-1.42.tar.bz2
2.0 r358ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-2.8.tar.bz2
2.1 r586ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-2.1.tar.bz2
3.2.2   ?   ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-2.2.2.tar.bz2
2.3.3   ?   ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-2.3.3.tar.bz2
2.4.5   ?   ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-2.4.5.tar.bz2
2.5.8   ?   ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-2.5.8.tar.bz2
2.6.3   ?   ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-2.6.3.tar.bz2
2.7.2   r10608  ftp://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-2.7.2.tar.bz2

Before we can do anything with these, we need to identify which Subversion 
revsion 
each one with a ? belongs to.  I've added three of ssb's identifications.  For
completeness I note thse for which we have no tarballs:

r1184 = 2.2, r2674 = 2.3.1, r4493 = 2.4.0 "minus two swapped commits",
r5867 = 2.5.0, r7771 = 2.6.0, r9996 = 2.7.0.

This recomstruction is being tracked here: 
https://gitlab.com/esr/gcc-conversion/issues/4
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond




Re: The far past of GCC

2019-12-29 Thread Eric S. Raymond
Jeff Law :
> I believe RCS was initially used circa 1992 on the FSF machine which
> held the canonical GCC sources.

That year sounds right - it's when I wrote the original vcs.el for Emacs
and a lot of Emacs users who hadn't been usiing version control started to.

Doesn't give us a Subversion revision, though.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Git conversion: fixing email addresses from ChangeLog files

2019-12-29 Thread Eric S. Raymond
Richard Earnshaw (lists) :
> I've just commented that one out for now; if anybody knows the correct
> addresses, please let me know.  Also, there's one joint list that I've
> not attempted to fix at this time.

> #  "28488": "Jim Kingdon <http://developer.redhat.com>",

That's Jim Kingdon the former CVS dev - I think he was involved in Subversion 
early too.

He's king...@cyclic.com or king...@panix.com, according to my back
mail. but since I think I remember that he did work at RedHat in the
late '90s king...@redhat.com would be a good bet too.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Git conversion: fixing email addresses from ChangeLog files

2019-12-28 Thread Eric S. Raymond
Joseph Myers :
> Concretely, what I'd suggest is: convert ISO-8859-1 entries in the 
> checked-in list to UTF-8, removing anything that thereby becomes a 
> duplicate or unnecessary; handle anything whose encoding isn't simply 
> ISO-8859-1 or UTF-8 via a hardcoded entry in bugdb.py using hex escapes 
> like the existing such entries there.  Once the checked-in list is pure 
> UTF-8 it's easier for people to review and edit.  Where the issue is only 
> presence of ISO-8859 NBSP, or "" or () around the names, remove that in 
> the checked-in list and again remove duplicates.  That way the list can be 
> limited to non-encoding variations.

Be aware that repusurgeon has a "transcode" command for moving
a specified set of object to UTF-8 from a specified encoding.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




The far past of GCC

2019-12-28 Thread Eric S. Raymond
In moving the history of a project old enough to have used
more than one version-control system, I think it's good practice
to mark the strata.  I'm even interested in pinning down the 
RCS-to-CVS cutover, if there's enough evidence to establish that.

I've added an issue to the tracker about this:

https://gitlab.com/esr/reposurgeon/issues/224

If you have knowledge of the relevant dates or SVN revisions, please
leave a comment on the issue.

I'm making this a public request becauause there was talk of gluing 
very old, pre-CVS tarballs to the history. Reposurgeon has primitives
to do this gracefully because one of my projects, INTERCAL, was old
enough to have pre-CVS tarballs and I felt there was value in preserving
that ancient history.

I think there is rather more value in preserving GCC's ancient history!
If nothing else, there are very few data sets on codebase growth with
as long a timespan.

Therefore, if you know where I can retrieve pre-CVS tarballs of GCC,
please leave the URLs in a comment on that issue thread.  I know about
the official GCC download page; the oldest tarball on it is evidently
from 1997, and I assume that is well after the project was CVSed. I'm
looking for older sources.
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond

The spirit of resistance to government is so valuable on certain occasions, 
that I wish it always to be kept alive.  It will often be exercised when 
wrong, but better so than not to be exercised at all. I like a little 
rebellion now and then. -- Thomas Jefferson, letter to Abigail Adams, 1787


Re: Test GCC conversion with reposurgeon available

2019-12-27 Thread Eric S. Raymond
Andreas Schwab :
> On Dez 25 2019, Eric S. Raymond wrote:
> 
> > That's easily fixed by adding a timezone entry to your author-map
> > entry - CET, is it?
> 
> The time zone is not constant.

Congratulations, you have broken one of reposurgeon's assumptions.

It is possible to use reposurgeon;d DSL tset committer TZ on a 
selected set of commits; if you want to work uo a patch for the
lift script we'll take it.
-- 
        http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-27 Thread Eric S. Raymond
Joseph Myers :
> reposurgeon results are fully reproducible (by design, the same inputs to 
> the same version of reposurgeon should produce the same output as a 
> git-fast-import stream,

Designer confirms, and adds that we gave a *very* stringent test suite
to verify this.

Much of it consists of bizarre malformations collected during past
conversions. GCC has added its share.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-27 Thread Eric S. Raymond
Richard Earnshaw (lists) :
> Well, personally, I'd rather we didn't throw away data we have in our
> current SVN repo unless it's unpresentable in the final conversion.

I agree with this philosophy. You will have noticed by now, I hope,
that reposurgeon peserves as much as it can, leaving deletions to be 
a matter of user policy.

In the normal case, reposurgeon could save its users a significant
amount of work by being more aggressive about automatically deleting
remnant bits that are merely *very unlikely* to be useful. I deliberately
refused to go thar route.

> Merge info is not one of those cases.

Sometimes. Some Subversion mergeinfo operations map to Git's
branch-centric merging.  Many do not, corresponding to cherry-picks
that cannot be expressed in a Git history.

Reposurgeon does a correct but not complete job of translating 
mergeinfos that compose into branch merges.  It handles the simple,
cmmon cases and punts the tricky ones.

More coverage would theoretically be possible, but I don't
have the faintest clue what a general resolution rule would
look like.  Except I'm pretty sure the problem is bitchy-hard
and the solution really easy to get subtly wrong.

Frankly, I don't want to touch this mess with insulated
tongs. Somebody would have to offer me serious money to
compensate for the expected level of pain.
-- 
        http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-27 Thread Eric S. Raymond
Maxim Kuvyrkov :
> Removing auto-generated .gitignore files from reposurgeon conversion
> would allow comparison of git trees vs gcc-pretty and gcc-reparent
> beyond r195087.  So, while we are evaluating the conversion
> candidates, it is best to disable conversion features that cause
> hard-to-workaround differences.

I was going to write that feature yesterday, then Julien nipped in and
did it while my back was turned.  It's a read option,
--no-automatic-ignores.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Test GCC conversion with reposurgeon available

2019-12-26 Thread Eric S. Raymond
Vincent Lefevre :
> What matters is that the date is correct. I don't think the timezone
> matters (that's why SVN doesn't store timezone information, I assume),
> possibly except for the committer himself (?). For instance,

Subversion doesn't store timezone because all commits are consifered
to have occurred at UTC time on a central repository.

I think time as well as date matters because soimetimes it could be 
information of significance what order commits were in even if they 
were on the same day.
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond




Re: Test GCC conversion with reposurgeon available

2019-12-26 Thread Eric S. Raymond
Toon Moene :
> On 12/26/19 10:30 PM, Eric S. Raymond wrote:
> 
> > Me, I don't undertstand why version-control systems designed for distributed
> > use don't ignore timezones entirely and display all times in UTC - relative
> > time is surely more imoortant than the commit time's relationship to solar
> > noon wherever the keyboard happened to be. But I don't make these decisions.
> 
> So we are going to base this world wide free software endeavor on a source
> code system that doesn't keep time by UTC ?

They all *do* keep time by UTC.

What confuses me is why they every try to *display* anything other than UTC.
It seems pointless to me to ever display local time in clients, but they do it
anyway. 

Wiothout that complication, there would be no need to track user timezones.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Test GCC conversion with reposurgeon available

2019-12-26 Thread Eric S. Raymond
Vincent Lefevre :
> > Here's why you want to get timezones right: there are going to be times
> > when the order of commits is significant information for a developer's
> > understanding of what happened.  But without a timezone you only know 
> > the actual time of a commit to 24-hour resoltion.
> 
> I don't understand what you mean. What matters for the order of
> commits is the global time, and this is what SVN stores. SVN does not
> store timezone information, i.e. it has no idea of what local time of
> the user had, but I don't think this is important information.

UTC time plus a timezone offset set is what git stores.  That's not the
locus of the problem.

In Subversion-land there's newver any doubt about the sequence of commits;
the revision numbers tell you that.  In Git-land you have to go by timestamps,
and if a timezone entry is wrong it can skew the displayed time.

Me, I don't undertstand why version-control systems designed for distributed
use don't ignore timezones entirely and display all times in UTC - relative
time is surely more imoortant than the commit time's relationship to solar
noon wherever the keyboard happened to be. But I don't make these decisions.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-26 Thread Eric S. Raymond
Alexandre Oliva :
> I don't see that it does (help).  Incremental conversion of a missed
> branch should include the very same parent links that the conversion of
> the entire repo would, just linking to the proper commits in the adopted
> conversion.  git-svn can do that incrementally, after the fact; I'm not
> sure whether either conversion tool we're contemplating does, but being
> able to undertake such recovery seems like a desirable feature to me.

It's all in what you have in the lift script.  Reposurgeon can do any kind
of branch surgery you want, and that can be added to the conversion pipeline
and replicated every time.

> >From what I read, he's doing verifications against SVN.  What I'm
> suggesting, at this final stage, is for us to do verify one git
> converted repo against the other.

There are no tools for that, and probably won't be unless somebody
revives repodiffer. There isn't a lot of time left in the schedule for
that, and I have my hands full fixing other glitches.  (Minor issues
about parsing ChangeLogs and generated .gitignores; the serious
problems are well behind us at this point.)

> Maxim appears to be doing so and finding (easy-to-fix) problems in the
> reposurgeon conversion; it would be nice for reposurgeon folks to
> reciprocate and maybe even point out problems in the gcc-pretty
> conversion, if they can find any, otherwise the allegations of
> unsuitability of the tools would have to be taken on blind faith.

Joseph has already made the call to go with a reposurgeon-based
conversion for reasons he explained in detail on this list. Given
that, it really doesn't make any sense for me to do any of what
you're proposing with time I could use working on Joseph's RFEs
instead.

If you're concerned about the quality of reposurgeon's conversion,
you'd be a good person to work on a comparison tool. Should I email you
a copy of the repodiffer code as it last existed in my repository?
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-26 Thread Eric S. Raymond
Alexandre Oliva :
> On Dec 25, 2019, "Eric S. Raymond"  wrote:
> 
> > Reposurgeon has a reparent command.  If you have determined that a
> > branch is detached or has an incorrect attachment point, patching the
> > metadata of the root node to fix that is very easy.
> 
> Thanks, I see how that can enable a missed branch to be converted and
> added incrementally to a converted repo even after it went live, at
> least as long as there aren't subsequent merges from a converted branch
> to the missed one.  I don't quite see how this helps if there are,
> though.

There's also a command for cutting parent links, ifvthat helps.

> Could make it a requirement that at least the commits associated with
> head branches and published tags compare equal in both conversions, or
> that differences are known, understood and accepted, before we switch
> over to either one?  Going over all corresponding commits might be too
> much, but at least a representative random sample would be desirable to
> check IMHO.

repotool compare does that, and there's a production in the conversion
makefile that applies it.

As Joseph says in anotyer reply, he's already doing a lot of the 
verifications you are suggesting.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Test GCC conversion with reposurgeon available

2019-12-25 Thread Eric S. Raymond
Joseph Myers :
> On Wed, 25 Dec 2019, Andreas Schwab wrote:
> 
> > On Dez 25 2019, Joseph Myers wrote:
> > 
> > > Timezones for any email address can be specified in gcc.map for any 
> > > authors wishing to have an appropriate timezone used for their commits.
> > 
> > But that should not be used for unrelated authors.
> 
> It's not.
> 
> On investigation, I think you are referring to the conversion of r269472.  
> That was committed for you by Jim Wilson and thus has you as author and 
> Jim Wilson as committer and Jim Wilson's timezone entry has been applied.  
> So the argument here is that the author's timezone information should be 
> applied to the author date, and the committer's timezone information 
> should be applied to the committer date.  I expect that should be 
> straightforward (although when coming from SVN, there's also an argument 
> that we only have committer dates so the committer timezone is the 
> relevant one to apply).

Theee's also an FSF policy about Changelogs that's relevant, I think.

Git sometimes fills in the author field from the committer, and
Changelog parsing is done only after translation. That's probably the
source of this bug.

If anybody cares enough to file a bug with a test load attached, I
can probably fix this.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Test GCC conversion with reposurgeon available

2019-12-25 Thread Eric S. Raymond
Segher Boessenkool :
> The goal is not to pretend we never used SVN.

One of *my* goals is that the illusion of git back to the beginning of
time should be as consistent as possible.

> The goal is to have a Git repo that is as useful as possible for us.

Exactly.  I've already written about minimizing cognitive friction.

Here's why you want to get timezones right: there are going to be times
when the order of commits is significant information for a developer's
understanding of what happened.  But without a timezone you only know 
the actual time of a commit to 24-hour resoltion.

There is no way we'll get this perfect.  But there is more wrong and
less wrong, and reposurgeon tries hard to be less wrong.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Test GCC conversion with reposurgeon available

2019-12-25 Thread Eric S. Raymond
Andreas Schwab :
> Definitely not.  I have never authored or committed any revision in the
> -0800 time zone.

That's easily fixed by adding a timezone entry to your author-map
entry - CET, is it?  That will prevent reposurgeon from making any
attempt to deduce your timezone.

It would be interesting to know how reposurgeon got misled.  Most
likely it was by a Changelog entry.  Reposurgeon watches as these are
being processed to see if it can pin an email address to a single timezone
by looking up its TLD in the IANA database.

I don't know how that could land you in California, though. Maybe
I ought to be logging timezone deductions so we can trace them back.

Has anyone else seen wrong timezone attributions?
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-25 Thread Eric S. Raymond
Segher Boessenkool :
> Or doing what everyone else does: put an empty .gitignore file in
> otherwise empty directories.

That is an ugly kludge that I will have no part of whatsoever.

Conversion artifacts like this are sources of cognitive friction and
confusion that take developers' attention away from the substantive
part of their work.  Each individual one may be minor, but the
cumulative effect can be a chronic distraction that us not less 
because developers are unware or ibly half-aware of it.

Thus, the goal of a repository converter should be to bridge smoothly
between the native idioms of the source and target systems,
*minimizing* conversion artifacts.

The ideal should be to produce a converted history that looks as much
as possible like it has always been under the target
system. Developers should have no need to know or care that the
history used to be managed differently unless they need to do
sonething that *unavoidably* crosses that boundary, like looking uo a
legacy ID grom an old bug report.

Reposurgeon was designed for this goal from the beginning.   
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Test GCC conversion with reposurgeon available

2019-12-25 Thread Eric S. Raymond
Joseph Myers :
> These are all cases covered by the request-for-enhancement issue for 
> adding Co-Authored-by: when the ChangeLog header names multiple authors, 
> as the corresponding de facto git idiom for that case.

I apologize, but I am growing doubtful I can deliver that.  Even if I
can, it may take longer than your conversion schedule allows given
that we've only got five days on the clock.  Here are the problems:

1. I don't have a reduced test case to validate parsing against.

2. The ChangeLog-parsing code is fragile and difficult to modify.
   This is inherent - the syntactic cues it's working with are weak
   and false matches are all too easy.
   
I've got to have 1 before I can even try to deal with 2.
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-25 Thread Eric S. Raymond
Alexandre Oliva :
> I know very little about reposurgeon, but I'm concerned that, should we
> make the conversion with it, and later identify e.g. missed branches, we
> might be unable to make such an incremental recovery.  Can anyone
> alleviate my concerns and let me know we could indeed make such an
> incremental recovery of a branch missed in the initial conversion, in
> such a way that its commit history would be shared with that of the
> already-converted branch it branched from?

Reposurgeon has a reparent command.  If you have determined that a
branch is detached or has an incorrect attachment point, patching the
metadata of the root node to fix that is very easy.

> Now, would it be too much of a burden to insist that the commit graphs
> out of both conversions be isomorphic, and maybe mappings between the
> commit ids (if they can't be made identical to begin with, that is) be
> generated and shared, so that the results of both conversions can be
> efficiently and mechanically compared (disregarding expected
> differences) not only in terms of branch and tag names and commit
> graphs, but also tree contents, commit messages and any other metadata?
> Has anything like this been done yet?

On the GCC repository, no. 

There are very serious practical problems with full verification of
git against SVN stemming mainly from the fact that Subversion checkout
on a respository of this size is extremely slow. IIRC Joseph at one
point estimated a check time on the order of months due to that
overhead alone.

If you're talking about a commit-by-commit comparison between two
conversions that assumes one or te other is correct, that is
theoretically possible and - because git retrieval is much faster -
could theoretically be done in a reasonable amount of time.  But there
is a lot of devil in the practical details.

The reposurgeon suite once included a tool for such comparisons.
Last year this happened:

commit b8a609925ba70a6b68f9eda1d748eb667ad2fa59
Author: Eric S. Raymond 
Date:   Fri Aug 24 12:40:46 2018 -0400

Retire repodiffer.  Its only use case was checks against git-svn...

...which we now know to make such bad conversions that on larger than 
trivial
repos the differ would be prohibitively noisy.

Maxim's scripts probably make a better conversion than bare git-svn,
because he uses git-svn only for linear basic blocks and thereby
avoids its worst failure modes. In theory I could dust off repodiffer
and apply it.

That's in theory. In practice, on a repository this size I am not
greatly optimistic about getting a result that could be interpreted by
a Mark I brain.  The reasons go beyond git-svn's brain damage to the
same ontological-mismatch problems that make SVN-to-git conversion a 
headache in general.

You might think at least there'd be a 1:1 correspondence between
commits in the two conversions, but that's not going to be true for a
couple of different reasons.

1. Split commits. Reposurgeon decomposes these into pieces one per 
git branch.  I don't know what Maxim's scripts do.  I think Joseph turned
up that there are over a thousand of these in the GCC history.

2. There are three classes of commits in Subversion that don't really fit 
the git data model, (1) directory creation/deletion commits, (2) directory
copy commits, (3) property changes with no associated blob.

For each of these exceptional commits a converter to Git has a choice
of dropping the commit, turning it into some sort of annotated tag, or
leaving it in place as a zero-op commit (anomalous but not forbidden
in the git model). It is pretty much guaranteed that different
converters will make different choices about these, which will make
for huge amounts of noise in your attempt at a diff.

Checking for DAG isomorphism: again, theoretically possible,
practically pretty daunting.  It could be worse - general graph
isomorphism is not even known to be polynomial-time - but in this case
we can label corresponding commits with matching legacy IDs, which
should make possible an isomorphism check in linear time with a trivial
algorithm.

Well, except for split commits. That one would be solvable, albeit
painful.

The real problem here would be mergeinfo links.  It's not even obvious
what "correct" mapping of mergeinfo links is, in general, due to the
mismatch between Subversion's cherry-pick-based merge model and git's
branch merging.  Again, different converters will make different
choices. Reconciling them would be not fun.

There is another world of hurt lurking in "(disregarding expected
differences)".  How do you know what differences to expect? How are
you going to specify them?  What will interpret that spec?

There is more months of work here - nasty, wearing toil, with no
guarantee of a result with a decent signal-to-noise ratio.  Even
though I'm quite literally the best-qualified person on earth to do
it, I flinch at the thought.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




The new Subversion reader in reposurgeon is complete for GCC purposes

2019-12-24 Thread Eric S. Raymond
This morning Julian Rivaud and I fully qualified the new Subversion
dump stream reader against reposurgeon's test suite. This is the same
code Joseph Myers has been using recent versions of to make test 
conversions of the GCC history that appear correct.

We believe reposurgeon is now feature-complete for a full and correct
GCC conversion.  Caveat: The repository is too large for verification
on every single revision to be practical.

We have five remaining minor issues, mostly related to user-generaed
.gitignore files (as opposed to files generarted from svn:ignore
properties) that should not affect the GCC conversion. We expect to
fix these over the next few days, anyway.

We have one remaining RFE from Richard Earnshaw that would
be nice to have, but is not essential. I'll be working on that.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond

If I were to select a jack-booted group of fascists who are 
perhaps as large a danger to American society as I could pick today,
I would pick BATF [the Bureau of Alcohol, Tobacco, and Firearms].
-- U.S. Representative John Dingell, 1980


Re: Test GCC conversions (publicly) available

2019-12-19 Thread Eric S. Raymond
Richard Earnshaw (lists) :
> > No, I was thinking more of rearnsha bailing out to handle a family emergency
> > and muttering something about not being back for a couple of weeks. If 
> > that's
> > been resolved I haven't heard about it.
> 
> I don't think that should affect things, as I think Joseph has a good handle
> on what needs to be done and I think I've handed over everything that's
> needed w.r.t. the commit summary reprocessing script.

OK, that's good to know.  I wish you good fortune with the emergency.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Test GCC conversions (publicly) available

2019-12-19 Thread Eric S. Raymond
Joseph Myers :
> On Thu, 19 Dec 2019, Eric S. Raymond wrote:
> 
> > There are other problems that might cause a delay beyond the
> > 31st, however. Best if I let Joseph nd Richard explain those.
> 
> I presume that's referring to the checkme: bug annotations where the PR 
> numbers in commit messages seem suspicious.  I don't think that's 
> something to delay the conversion unless we're clearly close to having a 
> complete review of all those cases done; at the point of the final 
> conversion we should simply change the script not to modify commit 
> messages in any remaining unresolved suspicious cases.

No, I was thinking more of rearnsha bailing out to handle a family emergency
and muttering something about not being back for a couple of weeks. If that's
been resolved I haven't heard about it.

The only conversion blocker that I know is still live is the wrong attributions
in some ChangeLog cases.  I'm sure we'll get that fixed soon; at this point
I'm more worried about getting the test suite to run clean again.

The scenario I want to avoid is the where you get a conversion that looks 
production-ready before I get my tests cleaned up, you deploy it -
and then I find something during the remainder of my cleanup that implies
a problem with your conversion.

A complicating factor is that I'm getting stale.  I've been going hammer and 
tongs 
at this for nearly three months now, and that's not counting all the previous
time on the Go translation. My defect rate is going up. I need a vacation 
or to work on something else for a while and I can't have that yet.

Never nind. We'll get this done.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Test GCC conversions (publicly) available

2019-12-19 Thread Eric S. Raymond
Mark Wielaard :
> Do we already have a new date for when we are making that decision?

I believe Joseph was planning on Dec 31st.

My team's part will be ready - the enabling reposurgeon changes should
done in a week or so, with most of that being RFEs that could be
dropped if there were real time pressure.

There are other problems that might cause a delay beyond the
31st, however. Best if I let Joseph nd Richard explain those.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Unix philosopy vs. poor semantic locality

2019-12-18 Thread Eric S. Raymond
Joseph Myers :
> On Wed, 18 Dec 2019, Eric S. Raymond wrote:
> 
> > And that, ladies and gentlemen, is why reposurgeon has to be as
> > large and complex as it is.
> 
> And, in the end, it *is* complex software on which you build simple 
> scripts.  gcc.lift is a simple script, written in the domain-specific 
> reposurgeon language.

The Patterns crowd speaks of "alternating hard and soft layers".

The design of reposurgeon was driven by two insights:

1. Previous VCS-conversion tools sucked in part because they tried to
be too automatic, eliminating human judgment. Repposurgeon is designed
and intended to be a *judgment amplifier*, doing mechanics and freeing
the human operator to think about conversion policy. Hence the DSL.

2. git fast-import streams are a pretty capable format for interchanging
version-control histories. Not perfect, but  good enough that you can gain
a lot by co-opting existing importers and exporters.

Mate the idea of a judgment-amplifying DSL to a structure editor for
git fast-import streams and reposurgeon is what you get.
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond




Unix philosopy vs. poor semantic locality

2019-12-18 Thread Eric S. Raymond
[New thread]

Segher Boessenkool :
> > And the "simple scripts" argument dismisses the fact that those scripts
> > are built on top of complex software.  It just doesn't hold water IMHO.
> 
> This is the Unix philosophy though!

I'm now finishing a book in which I have a lot to say about this, inspired
in part by experience with reposurgeon.

One of the major concepts I introduce in the book is "semantic
locality".  This is a property of data representations and structures.
A representation has good semantic locality when the context you need
to interpret any individual part of it is reliably nearby.

A classic example of a representation wth good semantic locality is a Unix
password file.  All the information associated with a username is 
on one line. It is accordingly easy to parse and extract individual 
records.

Databases have very poor semantic locality.  So do version-control
systems.  You need a lot of context to understand any individual data
element, and that context can be arbitrarily far away in terms of
retrieval complexity and time.

The Unix philosophy of small loosely-coupled tools has few more
fervent advocates than me. But I have come to understand that
it almost necessarily fails in the presence of data representations
with poor semantic locality.

This contraint can be inverted and used as a guide to good design: 
to enable loose coupling, design your representations to have
good semantic locality.

If the Unix password file were an SQL database, could you grep it?
No. You'd have to use an SQL-specific query method rather than a
generic utility like grep that is uncoupled from the specifics of
the database's schema.

The ideal data representation for enabling the Unix ecology of tools
is textual, self-describing, and has good semantic locality.

Historically, Unix programmers have understood the importance of
textuality and self-description.  But we've lacked the concept of
and a term for semantic locality.  Having that allows one to
talk about some things that were hard to even notice before.

Here's one: the effort required to parallelize an operation on
a data structure is inversely proportional to its semantic locality.

If it has good semantic locality, you can carve it into pieces that
are easy units of work for parallelization.  If it doesn't...you
can't. Best case is you'll need locking for shared parts. Worst case
is that the referential structure of the representation is so
tangled that you can't parallelize at all.


Version-control systems rely on data structures with very poor
semantic locality.  It is therefore predictable that attacking them
with small unspecialized tools and scripting is...difficult.

It can be done, sometimes, with sufficient cleverness, but the results
are too often like making a pig fly by strapping JATO units to
it. That is to say: a brief and glorious ascent followed by entirely
predictable catastrophe.

Having trouble believing me?  OK, here's a challenge: rewrite GCC's
code-generation stage in awk/sed/m4.  

The attempt, if you actually made it, would teach you that poor
semantic locality forces complexity on the tools that have to deal
with it.

And that, ladies and gentlemen, is why reposurgeon has to be as
large and complex as it is.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-18 Thread Eric S. Raymond
Joseph Myers :
> Nor do I think reposurgeon (or at least the SVN reader, which is the main 
> part engaged here) is significantly more complicated than implied by the 
> task it's performing of translating between the different conceptual 
> models of SVN and git.  I've found it straightforward to produce reduced 
> testcases for issues found, and fixed several of them myself despite not 
> actually knowing Go.  The issues remaining are generally conceptually 
> straightforward to understand the issue and how to fix it.

Let me note for the record that I found Joseph's ability to find and
fix bugs in the reader quite impressive.

Maybe not as impressive as it would have been before the recent
rewrite.  That code used to be a pretty nasty hairball.  It's a lot
cleaner and easier to understand now.

But impedence-matching the two data models is tricky, subtler than it
looks, and has rebarbative edge cases.  Even given the ckeanest
possible implementatiion, troubleshooting it is no mean feat.
-- 
        http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-18 Thread Eric S. Raymond
Jeff Law :
> But it's not that freshly constructed, at least not in my mind.  All
> the experience ESR has from the python implementation carries to the Go
> implementation.

Not only do you have reposurgeon, you have me. I wish this mattered
less than it does.

I have *far* more experience doing big, messy repository moves than
anybody else.  I try to exteriorize that knowledge into the
reposurgeon code and documents as much as I can, but as with other
kinds of expertise a lot of it is implicit knowledge that is only
elicited by practice and answering questions.

On small conversions of clean repositories such implicit expertise
doesn't matter too much. You may be able to pull off a once-and-done
with the tools, especially if they're my tools and you've read all my
stuff on good practice.

As an example, the CVS-to-git conversion of groff didn't really need
me. Lifts from CVS are normally horrible, but the groff devs were the
best I've ever seen at not leaving debris from operator errors in the
history.  Any of them could have read my docs and done a clean
coversion in two hours. Only...there was no way to way to know that in
advance. The odds were heavily against it.

Emacs was, and GCC is, the messy opposite case.  You guys needed a
seasoned "I know these things so you don't have to" expert more than
you will probably ever really understand. And, sadly, there aren't any
others but me yet.  Nobody else has been interested enough in the
problem to invest the time.

> Where I think we could have done better would have been to get more
> concrete detail from ESR about the problems with git-svn.  That was
> never forthcoming and it's a disappointment.  Maybe some of the recent
> discussions are in fact related to these issues and I simply missed
> that point.

I posted this link before: http://esr.ibiblio.org/?p=6778

I can't actually tell you much more than that. Actually, if I
understood git-svn's failure modes in enough detail to tell you more I
might be less frightened of it.

Mostly what I know is that during several other conversions I have
stumbled across trails of metadata damage for which use of git-svn
seems to have been to blame. Though, admittedly, I'm not certain of
that in any individual case; the ways git-svn screws up are not
necessarily disinguishable from the aftereffects of cvs2svn conversion
damage, or from normal kinds of operator error.

Overall, though, defect rates seemed noticeably higher when git-svn had
been used as a front end. I learned to flinch when people wanting me
to do a full conversion of an SVN repo admitted git-svn had been deployed,
even though I was hard-put to explain why I was flinching.

> I do think we've gotten some details about the "scar tissue" from the
> cvs->svn transition as well as some of our branch problems.  It's my
> understanding reposurgeon cleans this up significantly whereas Maxim's
> scripts don't touch this stuff IIUC.

That's correct.  And again, no blame to Maxim for this; he took a
conventional approach that does as little analysis as it can get away
with, which can be a good tradeoff on smaller, cleaner repositories without
a CVS back-history.

>  There's still
> work going on, but I'd consider the outstanding issues nits and well
> within the scope of what can reasonably still be changing.

Issue list here: 

https://gitlab.com/esr/reposurgeon/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=GCC

Presently 6 items including 2 bugs. One of those bugs may already be
fixed, we're waiting on Joseph's current conversion to see.

Counting time do all the RFEs requested, polishing, and final review
I think we're looking at another week, maybe a bit less if things go
well.  You guys could get a final conversion under your Yule tree.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Test GCC conversion with reposurgeon available

2019-12-17 Thread Eric S. Raymond
Bernd Schmidt :
> I vote for including .cvsignore files. Their absence makes diff comparisons
> of "git ls-tree" on specific revisions needlessly noisy.

A few minutes ago I implmemted and pushed a --cvsignores read option
for Subversion dumps.  That should do what you eant.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-16 Thread Eric S. Raymond
Joseph Myers :
> I expect the next conversion run, started after that 
> one finishes, to include both parts of Richard's commit message 
> improvements, as well as an improvement to commit attribution extraction 
> from ChangeLog files (to include attributions from ChangeLog. 
> files, not just plain ChangeLog).

There is also a known but minor bug in ChangeLog mining at branch roots.
I'm working on that and expect to have a fix shortly.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-16 Thread Eric S. Raymond
Jeff Law :
> So unless there's something  Maxim's scripts are getting right that
> aren't by reposurgeon, then reposurgeon is the right choice.

It is still possible that the scripts could get things right that
reposurgeon doesn't. But the reverse question is also valid. Can
Maxim's scripts get everything right that reposurgeon does?

If anyone wants to audit for that, my test suite is open source.  May
the best program win!
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-16 Thread Eric S. Raymond
Segher Boessenkool :
> There is absolutely no reason to trust a system that supposedly was
> already very mature, but that required lots of complex modifications,
> and even a complete rewrite in a different language, that even has its
> own bug tracker, to work without problems (although we all have *seen*
> some of its many problems over the last years), and at the same time
> bad-mouthing simple scripts that simply work, and have simple problems.

Some factual corrections:

I didn't port to Go to fix bugs, I ported for better performance.
Python is a wonderful language for prototyping a tool like this, but
it's too slow and memory-hungry for use at the GCC conversion's
scale.  Also doesn't parallelize worth a damn.

I very carefully *didn't* bad-mouth Maxim's scripts - in facrt I have
said on-list that I think his approach is on the whole pretty
intelligent. To anyone who didn't have some of the experiences I have
had, even using git-svn to analyze basic blocks would appear
reasonable and I don't actually fault Maxim for it.

I *did* bad-mouth git-svn - and I will continue to do so until it no
longer troubles the world with botched conversions.  Relying on it is,
in my subject-matter-expert opinion, unacceptably risky. While I don't
blame Maxim for not being aware of this, it remains a serious
vulnerability in his pipeline.

I don't know how it is on your planet, but here on Earth having a
bug tracker - and keeping it reasonably clean - is generally 
considered a sign of responsible maintainership.


In conclusion, I'm happy that you're so concerned about bugs in
reposurgeon. I am too. You're welcome to file issues and help us
improve our already-extensive test suite by shipping us dumps that
produce errors.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-16 Thread Eric S. Raymond
Joseph Myers :
> * As we're part of the free software community as a whole rather than 
> something in isolation, choosing to make a general-purpose tool work for 
> our conversion is somewhat preferable to choosing an ad hoc approach 
> because it contributes something of value for other repository conversions 
> by other projects in future.

That's not just theory or sentiment. Reposurgeon is the best
any-VCS-to-any-VCS converter there is because every time I do a
conversion, I learn things, and that knowledge gets incorporated in
the code and the documentation around it.

Yes, in theory someone else could build a tool as good that
incorporates as much domain knowledge. So far, nobody has tried.  It's
unlikely anyone will, at this point, when they can join my dev team
and get the results they want with much less effort by improving
reposurgeon or one of its auxiliary tools.

Every time that happens, everybody - into the indefinite future - wins.
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-16 Thread Eric S. Raymond
Jeff Law :
> > It may not be my place to say, but...I think the stakes are pretty
> > high here.  If I were a GCC developer, I think I'd want the best
> > possible conversion even if that takes a little longer.
> Well, I'm not sure that's entirely true.

OK, that's a policy choice the GCC project is going to have to make.
I'm just the mechanic here.

Joseph Myers has made his choice.  He has said repeatedly that he
wants to follow through with the reposurgeon conversion, and he's
putting his effort behind that by writing tests and even contributing
code to reposurgeon.

We'll get this done faster if nobody is joggling his elbow. Or mine.
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-16 Thread Eric S. Raymond
Joseph Myers :
> When we're talking about something to be used 
> for the next 20 years we should make sure to get it right.

Segher and others should note that I'm not in the habit of sinking most of
a year of my time into problems that I don't think are extremely
important. This conversion *is* that important.

> conversions with an ad hoc script need much more thorough, trickier 
> validation because you don't benefit from knowing the tool has worked for 
> other conversions).

Nor, as far as I am aware, do the scripts have anything resembling
reposurgeon's test suite.

Segher Boessenkool:
> > If the reposurgeon conversion is not ready now, then it is too late
> > to be selected.

Maxim's conversion pipeline isn't ready either -- there are known
bugs with its result. Does that mean it's too late to select Maxim's
conversion? If so, what do you propose be done?

Please stop bellyaching and pitch in. Whether it's by fixing up
Maxim's conversion, helping improve the reposurgeon one,
or writing a conversion method of your own - I don't much care
and it's not my job to tell you what to do, anyway. Any of those
choices might be helpful; sniping from the sidelines is not.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-16 Thread Eric S. Raymond
Segher Boessenkool :
> > Do people really want to keep tweaking the conversions and postpone the
> > git switchover?
> 
> No.

It may not be my place to say, but...I think the stakes are pretty
high here.  If I were a GCC developer, I think I'd want the best
possible conversion even if that takes a little longer.

jsm28, rearnsha, and my reposurgeon crew are pretty close to a final
deliverable now. We know what the remaining issues are, they're not
major, and we have a strategy for fixing them. Have a little patience,
please.

Better yet, come over to #reposurgeon on freenode and help out. Anyone
who can run tests on a machine with >128GB RAM would be especially
welcome.
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-11 Thread Eric S. Raymond
Jonathan Wakely :
> That's good news and I'm relieved to hear it. Thanks.

Defect resolution has sped up noticeably since jsm28 and rearnsha 
showed up on #reposurgeon and started working directly with my crew.

Relax.  As Joseph reported, we've got this well in hand now. We might
even have a final conversion on the original 16 Dec deadline, though
I'm personally guessing it will take a bit longer than that.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-11 Thread Eric S. Raymond
Jonathan Wakely :
> My concern is that there is no conversion done using reposurgeon that
> *can* be used to do correctness checks.

We can in fact verify revisions of a GCC conversion in place using
repotool compare. Joseph Myers has been using this with reposurgeon's
readlimit to run tests.

Unfortunately, on a repository this large, it's not practical to run a
verification on every single revision. The blocker is the slowness of
svn checkout. In practice, you have to sample key revisions, with
particular attention to those at and just after known metadata
defects.

The conversion crew - which now includes Joseph Myers and Richard
Earnshaw, in addition to my co-developers Daniel Brooks and Julien
Rivaud - is diligently testing as it refines the last bits of the
conversion.  

I believe everybody on the crew is now satisfied that we're converging
on a good result.  It helps that we now have a detailed characterization
of the pathological trunk deletion at r184996; most of the conversion
problems radiated from that.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-09 Thread Eric S. Raymond
Bernd Schmidt :
> On 12/9/19 7:19 PM, Joseph Myers wrote:
> > 
> > For any conversion we're clearly going to need to run various validation
> > (comparing properties of the converted repository, such as contents at
> > branch tips, with expected values of those properties based on the SVN
> > repository) and fix issues shown up by that validation.  reposurgeon has
> > its own tools for such validation; I also intend to write some validation
> > scripts myself.
> 
> Would it be feasible to require that both conversions produce the same
> output repository to some degree? Can we just look at release tags and
> require that they have the same hash in both conversions, or are there good
> reasons why the two would produce different outputs?

There are a couple of areas that could produce divergences.

One is the part of the history before SVN was adopted. There's a lot of 
weird junk back there, artifacts from the cvs2svn conversion, that can produce
issues like fundamntal uncertainty about where a child branch should actually be
rooted on its parent.  Reposurgeon makes choices that are a-priori reasonable
in cases of doubt, but there are edge cases where a different conversion 
pipeline
could make different ones.

Another is how to translate tags. I don't know what Maxim's scripts do, but 
under reposurgeon a copy commit can have one of two dispositions:

(1) Become a lightweight tag (git reference) if the tag comment looks like 
it was autogenerated and carries no real information.

(2) Become a git annotated tag if we want to preserve the tag metadata (comment,
date stamp)

There's room for a certain amount of artistic license here.

Most conversions have few enough disputable cases that the differences between
renderings can be reviewed by eyeball. I'm not going to bet that will be true
of this one.  At the scale of this conversion, any form of comparative auditing
is pretty hopeless.  You get your assurance, if you get it, from believing
the correctness of the conversion tool.

Which is a major reason that reposurgeon has a *large* test suite. 98
general operations tests, 55 Subversion test dumps including a rogue's
gallery of metadata perversions gathered from pervious conversions,
and a cloud of surrounding auxiliary checks.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-09 Thread Eric S. Raymond
Joseph Myers :
> I think we should fix whatever the remaining relevant bugs are in 
> reposurgeon and do the conversion with reposurgeon being used to read and 
> convert the SVN history and do any desired surgical operations on it.

On behalf of the reposurgeon crew - Julien Rivaud, Daniel Brooks, and myself -
we thank you for that expression of confidence.

We'll do our damnedest to deliver rapidly.  We welcome oversight and
discussion at #reposurgeon on freenode, because we're just the mechanics.
You guys have to make the policy decisions.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-06 Thread Eric S. Raymond
Richard Biener :
> To me, looking from the outside, the talks about reposurgeon doing damage and 
> a rewrite (in the last minute) would fix it doesn't make a trustworthy 
> appearance either ;) 

*shrug* Hard problems are hard.

Every time I do a conversion that is at a record size I have to
rebuild parts of the analyzer, because the problem domain is seriously
gnarly. I'm having to rebuild more than usual this time because the
GCC repo is a monster that stresses the analyzer in particularly
unusual ways.

Reposurgeon has been used for several major conversions, including groff and 
Emacs.  
I don't mean to be nasty to Maxim, but I have not yet seen *anybody* who 
thought they
could get the job done with ad-hoc scripts turn out to be correct.  
Unfortunately,
the costs of failure are often well-hidden problems in the converted history
that people trip over months and years later.

Experience matters at this.  So does staying away from tools like git-svn that
are known to be bad.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Proposal for the transition timetable for the move to GIT

2019-12-06 Thread Eric S. Raymond
Maxim Kuvyrkov :
> The general conversion workflow is (this really is a poor-man's translator of 
> one DAG into another):
> 
> 1. Parse SVN history of entire SVN root (svn log -qv file:///svnrepo/) and 
> build a list of branch points.
> 2. From the branch points build a DAG of "basic blocks" of revision history.  
> Each basic block is a consecutive set of commits where only the last commit 
> can be a branchpoint.
> 3. Walk the DAG and ...
> 4. ... use git-svn to individually convert these basic blocks.
> 4a. Optionally, post-process git result of basic block conversion using "git 
> filter-branch" and similar tools.
> 
> Git-svn is used in a limited role, and it does its job very well in this role.

Your approach sounds pretty reasonable except for that part. I don't
trust git-svn at *all* - I've collided with it too often during
past conversions.  It has a nasty habit of leaving damage in places
that are difficult to audit.

I agree that you've made a best possible effort to avod being bitten
by using it only for basic blocks. That was clever and the right thing
to do, and I *still* don't trust it.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Commit messages and the move to git

2019-12-05 Thread Eric S. Raymond
Joseph Myers :
> On Thu, 5 Dec 2019, Joseph Myers wrote:
> 
> > On Thu, 5 Dec 2019, Eric S. Raymond wrote:
> > 
> > > Joseph Myers :
> > > > I just tried a leading-segment load up to r14877, but it didn't 
> > > > reproduce 
> > > > the problems I see with r14877 in a full repository conversion - it 
> > > > seems 
> > > > the combination with something later in the history may be necessary to 
> > > > reproduce the issue.
> > > 
> > > Great :-(
> > > 
> > > Well, there's a bisection-like strategy for finding the minimum
> > > leading segment that produces misbehavior.  My conversion crew will
> > > apply it as hard as we need to to get the job done.
> > 
> > I've now provided a reduced synthetic test (7 commits) for the issue 
> > observed at r14877, in issue 172.  It wouldn't surprise me if a fix for 
> > this synthetic test fixes both issues 171 and 172 (and it wouldn't 
> > surprise me if it's fixed in the new SVN dump reader).
> 
> And given the synthetic test I've added to issue 178, I suspect the same 
> problem is behind at least some of the missing file/directory deletions as 
> well.

Likely, yes.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Commit messages and the move to git

2019-12-05 Thread Eric S. Raymond
Joseph Myers :
> On Thu, 5 Dec 2019, Eric S. Raymond wrote:
> 
> > Joseph Myers :
> > > I just tried a leading-segment load up to r14877, but it didn't reproduce 
> > > the problems I see with r14877 in a full repository conversion - it seems 
> > > the combination with something later in the history may be necessary to 
> > > reproduce the issue.
> > 
> > Great :-(
> > 
> > Well, there's a bisection-like strategy for finding the minimum
> > leading segment that produces misbehavior.  My conversion crew will
> > apply it as hard as we need to to get the job done.
> 
> I've now provided a reduced synthetic test (7 commits) for the issue 
> observed at r14877, in issue 172.  It wouldn't surprise me if a fix for 
> this synthetic test fixes both issues 171 and 172 (and it wouldn't 
> surprise me if it's fixed in the new SVN dump reader).

If not, I think it soon will be. I expect that little synthetic test
to help a lot.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Commit messages and the move to git

2019-12-05 Thread Eric S. Raymond
Joseph Myers :
> I just tried a leading-segment load up to r14877, but it didn't reproduce 
> the problems I see with r14877 in a full repository conversion - it seems 
> the combination with something later in the history may be necessary to 
> reproduce the issue.

Great :-(

Well, there's a bisection-like strategy for finding the minimum
leading segment that produces misbehavior.  My conversion crew will
apply it as hard as we need to to get the job done.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Commit messages and the move to git

2019-12-05 Thread Eric S. Raymond
Joseph Myers :
> I think we currently have the following reposurgeon issues open for cases 
> where the present code results in incorrect tree contents and we're hoping 
> the new code will fix that (or make it much easier to find and fix the 
> bugs).  These are the issues that are most critical for being able to use 
> reposurgeon for the conversion.
> 
> https://gitlab.com/esr/reposurgeon/issues/167
> https://gitlab.com/esr/reposurgeon/issues/171
> https://gitlab.com/esr/reposurgeon/issues/172
> https://gitlab.com/esr/reposurgeon/issues/178

I'm aware these are the real blockers.

I was much more worried about the conversion before we figured out
that most of the remaining content mismatches seem to radiate out from
something weird that happened at r14877.  That's early enough that a
leading-segment load including it doesn't take forever.  Which means
it's practical to do detailed forensics on the defect even if you don't
have handy an EC12 instance with ridiculo-humongous amonts of RAM.

Now I'm pretty certain we can finish this.  A matter of when, not if.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Commit messages and the move to git

2019-12-05 Thread Eric S. Raymond
Richard Earnshaw (lists) :
> Ok, this is one to keep an eye on.  There are a number of anomalous commmits
> at present, which Eric is working on with a new approach to replaying the
> SVN data into reposurgeon.  Once that is done we're hoping that this sort of
> problem will go away.

Best case is it just goes away.  Worst case is we'll need to figure out what 
surgical commands
need to be patched into the recipe to deal with the remaining anomalies.

I suspect the latter, in particular that we're going to end up needing
to do something manually around r14877.  Iy might be a trivial tweak to
the splice command I commented out.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




GCC conversion work in progress

2019-12-05 Thread Eric S. Raymond
Those of you with a direct interest in the conversion might want
to watch #reposurgeon on freenode.  This is where Daniel Brooks,
Julien Rivaud and I are working on it.

Here's where the code lives:

reposurgeon: https://gitlab.com/esr/reposurgeon

The conversion recipe: https://gitlab.com/esr/gcc-conversion

In the next few days I expect the remaining problems to move from 
nechanism to policy choices.  At that point, broader review of the
recipe and the conversion progress starts to become desirable.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond

"They that can give up essential liberty to obtain a little temporary 
safety deserve neither liberty nor safety."
-- Benjamin Franklin, Historical Review of Pennsylvania, 1759.


Re: Branch and tag deletions

2019-12-05 Thread Eric S. Raymond
Joseph Myers :
> The avoidance of '.' in branch and tag names is, I'm pretty sure, a legacy 
> of CVS restrictions on valid names for branches and tags.  Those 
> restrictions are not relevant to git or SVN; if picking any new convention 
> it seems appropriate for the tag for GCC 10.1 to say "10.1" somewhere in 
> its name rather than "10_1".

That is correct.  I recommend mapping tags from using "_" to using
".", they're just plain more readable that way.  I have done this in 
previous conversions.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Branch and tag deletions

2019-12-04 Thread Eric S. Raymond
Joseph Myers :
> Eric, can Richard and I get direct write access to the gcc-conversion 
> repository?  Waiting for merge requests to be merged is getting in the way 
> of fast iteration on incremental improvements to the conversion machinery, 
> it should be possible to get multiple incremental improvements a day into 
> the repository.

Sure. I only found one "Richard Earnshaw" and one "Joseph Myers" on Gitlab,
so I have given both Developer access.  I changed thw branch protection rules so
Developers can push.
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond




Re: Commit messages and the move to git

2019-12-02 Thread Eric S. Raymond
Segher Boessenkool :
> Do we postpone the transition another few months because we have to check
> all commits for mistakes the conversion tool made because it tried to be
> "smart"?
> 
> Or will we rush in these changes, unnecessary errors and all, because
> people have invested time in doing this?
> 
> It is not a decision that can be made late.  It is a *design decision*.

Besr in mind that the tool is continuing to improve.  There are now three
people working on it effectively full-time in response to this conversion.

We will fix the attribution bug. Compared to dealing with dumpfile
malformations that sort of thing is a pretty easy problem once we have
a way to reproduce it.

At this point my only serious worry is what kinds of contortions we'll 
need to go through to get around the effects of the GCC/EGCS merge.
I'll be concentrating on that once I finish debugging the analyzer
rewrite.
-- 
        http://www.catb.org/~esr/";>Eric S. Raymond




Re: Branch and tag deletions

2019-11-30 Thread Eric S. Raymond
Joseph Myers :
> I did a comparison of git and SVN checkouts to look at missing file 
> problems.  I've now filed reposurgeon issues 171 and 172 for the problems 
> I noted.  Issue 171 relates to handling of trunk deletion / recreation.  
> Issue 172 relates to the first point where missing file problems appear 
> (unless some appeared and then disappeared in the history before then).  
> As it's at a very early point in the GCC history (r14877), hopefully it 
> shouldn't be too hard to track down if your rewrite doesn't fix it, since 
> it shouldn't require loading much of the history to reproduce.  (Roughly, 
> it's at the start of EGCS, i.e. around the point where we spliced together 
> the gcc2 and EGCS CVS histories when converting from CVS to SVN.  So some 
> bits of the history around then may well look weird, but I don't see 
> anything particularly odd about that particular SVN commit.)

Thank you, that is very valuable information to have.

There is probably some odd artifact at the merge point that confuses
my old code.  If we are fortunate, the new code won't be confused.

The old code was brittle and had failures in weird places because I
started on branch analysis and handling of mixed-branch commits too
early.  The new code essentially replays the dump operations into
commits without trying to do branch analysis or mixed-branch
resolution, then does those latter things in separate passes.

We'll know in a day or two, I think. The rewrite is done; I'm
troubleshooting some problems that I *think* are minor but which are
blocking merging to HEAD. 

Once I get the new analyzer passing regressions I'll do a read-limited
conversion up to r14900 and see what's up.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Branch and tag deletions

2019-11-27 Thread Eric S. Raymond
Joseph Myers :
> One more observation on that: in my last test conversion, deleting the 
> emptycommit-* tags took over 7 hours (i.e. the bulk of the time for the 
> conversion was spent just deleting those tags).  Deleting tags matching 
> /-root$/ took about half an hour.  So I think there is a performance issue 
> somewhere with (some cases of) tag deletion by regexp, at least when the 
> regexp matches a large number of tags (but some other bulk deletions seem 
> to run much quicker per tag).  Taking a few seconds per tag is fine for an 
> individual deletion, but a problem when you want to delete 4070 tags at 
> once.

File that as an issue, please. Go has very good profiling tools, finding
the hotspot(s) in situations like this is easy and thus we should be able to
fix this quickly when it reaches the top of the priority list,
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Branch and tag deletions

2019-11-27 Thread Eric S. Raymond
Joseph Myers :
> > I'm more worried about missing files. I saw a bunch of those on my
> > last test.  This could be spurious - the elaborate set of branch
> > mappings you specified confuses my validation test, because there is
> > no longer a 1-1 corresponsence between Subversion and git branches.
> 
> I'm hoping any such missing file problems come from bugs in the old SVN 
> dump reader with complicated commits mixing copies / deletions / 
> replacements with copies from other locations and that your rewrite will 
> fix the semantics in such cases.

Also possible.  

The old code was a hairball. The new code is a bunch of relatively simple
sequential passes - 10 so far, final version likely to have 12 or 13 - with
well-defined preconditions and exit contracts. If nothing else this is
going to make troubleshooting any remaining defects much easier.

> All the current gcc-conversion merge requests, both mine and Richard's, 
> should now be set to allow rebasing.

They were, and are all merged now, except for one that Richard just landed. 
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Branch and tag deletions

2019-11-27 Thread Eric S. Raymond
Joseph Myers :
> My current test conversion run is testing two changes: deleting 
> emptycommit tags, and using --user-ignores to prefer the .gitignore file 
> in SVN over one auto-generated from svn:ignore properties.  For the next 
> one after that I'll try eliminating all branch/tag removals that shouldn't 
> be doing anything, based on the current sets of branches and tags in SVN, 
> and report bugs if I see anything appearing in the converted repository 
> that shouldn't be.

I'm more worried about missing files. I saw a bunch of those on my
last test.  This could be spurious - the elaborate set of branch
mappings you specified confuses my validation test, because there is
no longer a 1-1 corresponsence between Subversion and git branches.

The next test I run I'm going to comment out your branch mappings.
If I get a validated conversion out of that I think it's all over
but the cleanup and policy tinkering.
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond




Re: Branch and tag deletions

2019-11-27 Thread Eric S. Raymond
Joseph Myers :
> On Wed, 27 Nov 2019, Maxim Kuvyrkov wrote:
> 
> > IMO, we should aim to convert complete SVN history frozen at a specific 
> > point.  So that if we don't want to convert some of the branches or tags 
> > to git, then we should delete them from SVN repository before 
> > conversion.
> 
> Sure, we could do that.  Eric, can you confirm that, with current 
> reposurgeon, if a branch or tag was deleted in SVN and does not appear in 
> the final revision of /branches or /tags, it should not appear in the 
> resulting converted repository, so that any cases where reposurgeon fails 
> to reflect such a deletion-in-SVN should be reported as a reposurgeon bug?

Confirmed.

The ontological mismatch between the Subversion and Git data models actually
*forces* us to pick a preferred view and discard tags and branches
that are not visible from that view. For obvious reasons reposurgeon chooses
the view backwards from the end of the history, so it will be the most
recent incarnation of each tag and branch that you see.

There is an alternative, the --nobranch conversion. This would preserve the
entire historical structure, including deleted tags and branches, but the
cost is that the conversion doesn't have git tags and branches itself - 
it's just one big directory history on /refs/heads/master.  While this is useful
for forensics, it is not a conversion you'd want to use for production.

> And that the same applies where a branch or tag was renamed - that only 
> the new name, not the old one, should appear in the converted repository? 

Confirmed, see above.

> There are quite a few deletions in gcc.lift for tags that do not actually 
> appear in /tags in the current SVN repository, but I'm not sure how many 
> are actually relevant with current reposurgeon.

Many will not be.  The recipe file predates the point at which I came
to fully undersrand the ramifications of tag delete/recreate sequences.
I haven't cleaned it up yet because chasing down the last few bugs in
the analyzer is more important.

I'll leave it to you guys to discuss the policy issues.  In general I
think you can safely throw out branchphoint tagas and emptycommits;
reposurgeon only preserves those on the theoretical chance that there
might be something interesting in the change comments.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Branch and tag deletions

2019-11-26 Thread Eric S. Raymond
Joseph Myers :
> Thanks.  We've accumulated a lot of merge requests on the gcc-conversion 
> repository, once those are merged I'll test a further change to remove 
> those tags.

I just checked; a rebase button appeared on your MRs and I merged all
three, but no rebase option occurs on Richard Earnshaw's reqyests.

The GitLab interface seems fickle and arbitrary at times.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Branch and tag deletions

2019-11-26 Thread Eric S. Raymond
Joseph Myers :
> A further note: in a previous run of the conversion I didn't see any 
> emptycommit-* tags.  In my most recent conversion run, I see 4070 such 
> tags.  How do I tell reposurgeon never to create such tags?  Or should I 
> add a tag deletion command for them in gcc.lift, once tag deletion is 
> working reliably?

That's what tag deletion by regexp is for.

One of reposurgeon's design rules is "never add a special-purpose switch
or flag when an application of the selection-set minilanfuage will do"
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond




Re: Branch and tag deletions

2019-11-25 Thread Eric S. Raymond
Joseph Myers :
> I'm looking at the sets of branches and tags resulting from a GCC 
> repository conversion with reposurgeon.
> 
> 1. I see 227 branches (and one tag) with names like 
> cxx0x-concepts-branch-deleted-r131428-1 (this is out of 780 branches in 
> total in a conversion of GCC history as of a few days ago).  Can we tell 
> reposurgeon not to create such branches (and tags)?  I can't simply do 
> "branch /-deleted-r/ delete" because that command doesn't take a regular 
> expression.

Those dead branches were supposed to never be visible in the final
conversion.

They arise when a tag is created, then deleted, then recreated under
the same name. The dumpfile operations for the old tag can't simply
ignored, as part of its content could get copied forward from before
the delete to a branch that remains live.  So I recolor them, then
have logic to skip generating commits and tags from them. You;re
seeing dome leak through those guards, which is a bug.

I'm using a different and much simpler strategy in the analyzer rewrite;
this bug should be squashed when it lands.

> 2. gcc.lift has a series of "tag  delete" commands, generally 
> deleting tags that aren't official GCC releases or prereleases (many of 
> which were artifacts of how creating such tags was necessary to track 
> merges in the CVS and older SVN era).  But some such commands are 
> mysteriously failing to work.  For example I see
> 
> tag /ZLIB_/ delete
> reposurgeon: no tags matching /ZLIB_/
> 
> but there are tags ZLIB_1_1_3, ZLIB_1_1_4, ZLIB_1_2_1, ZLIB_1_2_3 left 
> after the conversion.  This isn't just an issue with regular expressions; 
> I also see e.g.
> 
> tag apple/ppc-import-20040330 delete
> reposurgeon: no tags matching apple/ppc-import-20040330
> 
> and again that tag exists after the conversion.

I knew there was a problem with those, but I have not diagnosed it
yet.  I know generally where it has to be and think it will be
relatively easy to clean up once I've dealt with the more pressing
issues.

Please file issues about these bugs so I can track then.

On the first one, it would be helpful if you could list some tags
that these match expressions fail to pick up from as early as
possible. Shortening the leading segment I need to load speeds up 
my test cycle significantly,
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Split commit naming

2019-11-25 Thread Eric S. Raymond
Joseph Myers :
> My question is: is it a stable interface to reposurgeon that the portions 
> of such a split commit will always be numbered in lexicographical order by 
> branch name (or some other such well-defined stable ordering), so I can 
> write <80870.2> in gcc.lift and know that some reposurgeon change won't 
> accidentally make that refer to the portion of the commit on 
> gcc-3_3-branch instead?

Your timing is fortuitous, as I just finished rewriting the code for 
mixed-commit handling and it is fresh in my mind.

The old behavior was indeed that cliques were lexicographically ordered
by branch.  This was not documented.  The master branch still uses the
old code.

Current behavior on my development branch is that fileops are not
sorted before splitting; you get whatever order they had in the dump.
I will change this so they are sorted by pathname and document that.

And...it's done.

You won't see the new code for a few days, until I finish the analyzer
rewrite.  The old code had become overgrown and brittle; I spent a
week trying to find a strategy to get around a particular
pathological-tag defect only to discover that I could no longer
modify the analyzer without cascade bugs.

I'll describe the problem, since I think the GCC repository has some
of these and they may explain some of your earlier bug reports.

Suppose you create a tag, then later on modify the tag copy by
copying to one of its subdirectories.  When translating to git
you want to attach the tag reference to the revision the *second*
copy came from.  Simple in concept but the obvious implementation
of root-finding prefers the earliest copy.

When it proved impossible to change this wthout producing a cascade of
breakage, I faced up to the necessity of a scrap-and-rebuild.  It's
not done yet, but it's pretty well advanced.
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond




Re: Commit messages and the move to git

2019-11-21 Thread Eric S. Raymond
Richard Earnshaw (lists) :
> > But then I get errors:
> > 
> > *** Unknown syntax: relax
> > 
> 
> Change that to
> 
> set relax

Oops.  He's right.  It used to be a command, but that changed recently
as art of a redesign of log levels and options.
-- 
        http://www.catb.org/~esr/";>Eric S. Raymond




Re: Commit messages and the move to git

2019-11-21 Thread Eric S. Raymond
Joseph Myers :
> I see the changelogs issue is fixed (I can run a conversion past that 
> point on a system with 128GB memory, with mergeinfo processing being very 
> slow as described by Richard).  But then I get errors:
> 
> *** Unknown syntax: relax

Missing "relax" command probably means your reposurgeon is very old.
What does "version" say?
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond




Re: Commit messages and the move to git

2019-11-19 Thread Eric S. Raymond
Richard Earnshaw (lists) :
> Nope, that was from running the go version from yesterday.  This one, to
> be precise:  1ab3c514c6cd5e1a5d6b68a8224df299751ca637
> 
> This pass used to be very fast a couple of weeks back, but something
> went in recently that's caused a major slowdown.
> 
> Oh, and I've been having problems with the ChangeLogs command as well.
> It used to run fine on my machine (128G), but now it's started blowing
> memory and taking my X server down.

That sucks.  Those were stretches of code the two guys working with me
have been trying to speed up. Looks like that backfired.

Please file isses at  https://gitlab.com/esr/reposurgeon/issues and
include timing reports if you can.
-- 
        http://www.catb.org/~esr/";>Eric S. Raymond




Re: Commit messages and the move to git

2019-11-19 Thread Eric S. Raymond
3,278016-278017,278019-278028,278032-278035,278038,278041,278044-278049,278051,278053-278058,278062,278064-278066,278068-278070,278074-278091,278093-278096,278098-278107,278111-278129,278131-278142,278144-278153,278156-278157,278159,278179,278184-278185,278189-278196,278199-278200
> 
> and in the conversion we get about 35 links back to different revisions
> in trunk.
> 
> I don't know if the SHA codes are stable, but in my conversion, done
> last night, it comes out at 44b84e63a8b00b9881fbb93d3af1536c2338aa72
> 
> There's another example at r20 on the same branch, which has even
> more links.
> 
> R.

File an issue here, please.

https://gitlab.com/esr/reposurgeon/issues
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Commit messages and the move to git

2019-11-19 Thread Eric S. Raymond
Jason Merrill :
> Well, I was thinking of also giving some clue of what the commit was
> about.  One possibly cut-off line accomplishes that, a simple revision
> number not so much.

It's conventional under Git to have comments lead with a summary sentence.

I think you're going to find that the value of Subversion revision references
fades pretty fast after the conversion. That has been my experience with
other conversions.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Commit messages and the move to git

2019-11-19 Thread Eric S. Raymond
Richard Earnshaw (lists) :
> I was looking at the reposurgeon code last night, and I think I can see what
> the problem *might* be, but I haven't had time to produce a testcase.
> 
> Some of our commits have mergeinfo that looks a bit like this:
> 
> 202022-202023,202026,202028-202029,202036,202039-202041,202043-202044,202048-202049,202051-202056,202058-202061,202064-202065,202068-202071,202077,202079-202082,202084,202086-202088,202092-202104,202106-202113,202115-202119,202121,202124-202134,202139,202142-202146,202148-202150,202153-202154,202158-202159,202163-202165,202168,202172,202174,202179-202180,202184-202192,202195,202197,202202-202208,202225-202230,202232-202233,202237-202239,202242,202244-202245,202247,202250-202251,202258-202264,202266,202269,202271-202275,202279,202281-202282,202284,202286,202289-202292,202296-202299,202301-202302,202305,202309,202311-202323,202327-202335,202337,202339,202343-202346,202350,202352,202356-202357,202359-202360,202363-202371,202373-202374,202377,202379-202382,202384,202389,202391-202395,202398-202407,202409,202411,202416-202418,202421
> 
> which is a massive long list with a number of holes in it.
> 
> But I suspect the holes are really commits to other branches and that in the
> above describes a linear chain along one branch.  If so, rather than
> producing links to each subgroup (and perhaps dropping single non-list
> elements, the description can be mapped back to a contiguous sequence of
> commits down a branch and thus should really resolve to a single child being
> used for the merge source.  At present, I think for the above we're seeing a
> child reference created for each subrange in that list.

I have no doubt you are correct. Detecting such interrupted ranges ia
foing to be...  interesting.

> Incidentally, the mergeinfo pass on the gcc repo is currently taking about 8
> hours on my machine, that's 80-90% of the entire conversion time.  But it
> might be related to the above.

You must be running the old Python code, there was on O(n**2) in that
phase that has since been fixed. Try the Go code from
https://gitlab.com/esr/reposurgeon; it is *much* faster.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Commit messages and the move to git

2019-11-19 Thread Eric S. Raymond
Joseph Myers :
> I think the main thing to make sure of in the conversion regarding that 
> issue is that cherry-picks do *not* turn into merge commits

I confirm that this is how it now works.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Commit messages and the move to git

2019-11-19 Thread Eric S. Raymond
Richard Earnshaw (lists) :
> Well a lot of that is a property of the conversion tool.  git svn does a
> relatively poor job of anything other than straight history (I believe it
> just ignores the non-linear information.

Yes, svn-git does a *terrible* job on anything other than linear history.

That is a major reason I'm busting my hump to get the conversion done.
It would be very sad if you guys fell into using that.  It does a
tolerable job of live gatewaying on simple histories, but read this:

http://esr.ibiblio.org/?p=6778

>   I don't believe any tool can
> recreate information for cherry-picking unless it's recorded in the SVN
> meta-data.  Eric would be better placed to comment here.

You are correct, there is nothing practical that can be done in the absence
of svn:mergeinfo and svnmerge-integrated properties.

> My own observation is that when the SVN commits have merge meta-data,
> reposurgeon will pick this up and create links across to the relevant
> branches.  It does, however seem to create far more links than a traditional
> git merge would do, especially when a sequence of commits are referenced.  I
> don't know if that's essentially unfixable, or if it's something Eric
> intends to work on; but I've seen some cases where there are dozens of links
> back to a simple sequence of svn commits and where, I suspect, a single link
> back to the most recent of that sequence would be all that's really wanted.

First I have heard of this.

The intent of the present mergeinfo handing is that it looks for
mergeinfo declarations that are topologically equivalent to branch
merges (that is, they merge all revisions on a source branch rather
than cherry-picking isolated revisions) and rendering those as
gitspace merge links.  There is no attempt to create links
corresponding to Subversion cherry picks, as this does not fit
the Git DAG model.

I have cases that demonstrate this feature working in my test suite,
but they are relatively small and artificial. I would not describe
my mergeinfo handling as well-tested compared to the rest of the
analyzer, and I can thus easily believe your bug report.

What I need to troubleshoot this is a test case that is not trivial
but of a manageable size - over a couple hundred commits the volume
of diagnostics just overwhelms a Mark One Eyeball.  

Many of my test cases were trimmed to that size by doing stripping and
topological reduction on real repositories; I have a tool for this.
Do you have a real repository in mind I can start with?  The whole gcc
history is too huge, but if you were able to tell me that the bug is
exhibited within a few thousand commits of origin and point at where,
that I could work with.

An issue filed on the reposurgeon tracker would be appreciated.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Commit messages and the move to git

2019-11-08 Thread Eric S. Raymond
Richard Earnshaw (lists) :
> Which makes me wonder if, given a commit log of the form:
> 
> 
> 2019-10-30  Richard Biener  
> 
> PR tree-optimization/92275
> * tree-vect-loop-manip.c (slpeel_update_phi_nodes_for_loops):
> Copy all loop-closed PHIs.
> 
> * gcc.dg/torture/pr92275.c: New testcase.
> 
> Where the first line is a ChangeLog style date and author, we could spot the
> PR line below that and hoist it up as a more useful summary (perhaps by
> copying it rather than moving it).
> 
> It wouldn't fix all commits, but even just doing this for those that have
> PRs would be a help.

Speaking from lots of experience with converting old repositories that
exhibited similar comment conventions, I would be nervous about trying
to do this entirely mechanically.  I think the risk of mangling text
that is not fornatted as you expect - and not noticing that until the 
friction cost of fixing it has escalated - is rather high.

On the other hand, reposurgeon allows a semi-neechanized attack on
the problem that I think would work well, because I've done similar 
things in ither coversions.

There's a pair of commands that allow you to (a) extract comments from
a range of commits into a message list that looks like an RFC822
mailbox file, (b) modify those comments, and (c) weave the the message
list reliably back into the repository.

If it were me doing this job, I'd write a reposurgeon command that
extracts all the comments containing PR strings into a message box
Then I'd write an Emacs macro that moves to the next nessage and
hoists its PR line.

Then I'd walk through the comments applying the macro and keeping an eye on 
them for cases where what the macro doesn't do quite the right thing and 
using undo and hand-editing to recover.  Human eyes are very good at 
spotting anomalies in an expected flow of textm and once you've gotten
into the rhythm of a task like this is is easily possible to filter
approximately a message per second. In round numbers, providing
the anomaly rate isn't high, that's upwards of 3000 messages per hour.

The point is that for this kind of task a hnman being who undertands 
what he's reading is likely to have a lower rate of mangling errors than
a program that doesn't.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Fixing cvs2svn branchpoints

2019-11-07 Thread Eric S. Raymond
Joseph Myers :
> Which mid-branch deletes?  For the ones by accident (e.g. the deletions of 
> trunk), where the branch was recreated by copying from the pre-deletion 
> version of the same branch, nuking the deletes is clearly right.  For the 
> ones where a branch was deleted then recreated as a copy not from the 
> deleted version - essentially, rebasing done in SVN - maybe we need 
> community discussion of the right approach.  (There are two plausible 
> approaches there - either just discard all the deleted versions that 
> aren't part of the SVN history of the most recent creation of the branch, 
> which makes the list of commits in the branch's history in git look 
> similar to what it looks like in SVN, or treat deletion + recreation in 
> that case as some kind of merge.)

To get content right, reposurgeon has to run through all nodes looking for
branches with more than one creation.  For each such clique, it has to change
all instances but the last so that the branch has a unique nonce name,
then run forward and patch all copy references to the each branch to use
the nonce name.

Only the last branch in each clique will be visible (and not renamed)
in the git conversion.  But the earlier branches can't simply be
nuked, as they might be (and typically are) referenced by branch
copies done before the final branch in the clique was created.

This might sound like it will get the special case of a trunk
delete/recreate wrong.  But when git imports a stream it does its own 
branch recoloring based on tip resets and parent-child-relationships; 
we can expect trunk to be (effectively) re-colored back to the root commit.

(This whole mess around branch re-creation is something other
conversion tools don't even try to get right.)

The other case - where you delete a target branch and copy a different
source branch over it - is simpler.  Because branch names in the
git conversion are controlled by the SVN repository pathname (root becomes
master, branches/foo becomes branch foo, etc), this looks exactly like
an ordinary modification of the target branch.

Presently, the fact of the copy is not recorded in the DAG. I could express 
it as a git merge link; that wouldn't be difficult.

> > Also please open reposurgeon issues about the svnmerge properties
> 
> As I understand it, support for that has now been implemented.

It has, yes.

> > and the missing documentation.
> 
> https://gitlab.com/esr/reposurgeon/issues/151 filed - it's a lot more than 
> just reparent for which documentation appears to have disappeared.

A large chunk of the section on surgical comands vanished, probably
due to a finger error wgile I was editing the translation.  I have
restored it.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Commit messages and the move to git

2019-11-07 Thread Eric S. Raymond
Jeff Law :
> On 11/4/19 3:29 AM, Richard Earnshaw (lists) wrote:
> > With the move to git fairly imminent now it would be nice if we could
> > agree on a more git-friendly style of commit messages; and, ideally,
> > start using them now so that the converted repository can benefit from
> > this.
> > 
> > Some tools, particularly gitk or git log --oneline, can use one-line
> > summaries from a commit's log message when listing commits.  It would be
> > nice if we could start adopting a style that is compatible with this, so
> > that in future commits are summarized in a useful way.  Unfortunately,
> > some of our existing commits show no useful information with tools like
> > this.
> I'd suggest we sync policy with glibc.  They're further along on the
> ChangeLog issues.  Whatever they do in this space we should follow --
> aren't we going to be using some of their hooks/scripts?

Note that my reposurgeon conversion recipe runs gitify on the repository.

>From the documentation:

Attempt to massage comments into a git-friendly form with a blank
separator line after a summary line.  This code assumes it can insert
a blank line if the first line of the comment ends with '.', ',', ':',
';', '?', or '!'.  If the separator line is already present, the comment
won't be touched.

Takes a selection set, defaulting to all commits and tags.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Re: Fixing cvs2svn branchpoints

2019-11-02 Thread Eric S. Raymond
Joseph Myers :
> And here are corresponding lists of tags where the commit cvs2svn 
> generated for the tag should be reparented.

Make that issue 2, please.  Also, open an issue 3 about how you want those
mid-branch deletes handled.  I agree that the right thing is just to nuke
them, but I have a lot of plates in the air right now...

Also please open reposurgeon issues about the svnmerge properties and the
missing documentation.  I might get to the svnmerge thing today, it
should be a trivial tweak.

The repository comparison is still grinding.  It has turned up some
content mismatches, fewer than last time, most in trunk/libgo.

The reason for the "fewer" is that the Go version has learned how to
correctly handle a corner case the Python did not - tag/branch delete
followed by a recreation at a different root point.  That's why this
is commented out in the lift script:

# Squash accidental trunk deletion and recreation.
# Should no longer be needed due to branch recoloring.
#<130803.1>,<138077>,<184996.1> squash

I used to have to find defects like that by hand and patch them. Now
there's a recoloring phase where branches and tags with multiple
creations are handled by renaming all but the last such branch in each
clique to a unique nonce name.  This makes all the results from branch
copies come out right, and none of the nonce names are ever visible in
the final conversion.

I'll go dive into the defect analysis now.
-- 
        http://www.catb.org/~esr/";>Eric S. Raymond




Re: Fixing cvs2svn branchpoints

2019-10-31 Thread Eric S. Raymond
Joseph Myers :
> Here are complete lists of reparentings I think should be done on the 
> commits that start branches, along with my notes on branches with messy 
> initial commits but where I don't think any reparenting should be done.  
> The REPARENT: lines have the meaning I described in 
> <https://gcc.gnu.org/ml/gcc/2019-10/msg00127.html>.

Please leave this as an issue on the gcc-conversion bugtracker.

Your timing is interesting.  Happens I got my first full conversion
with the Go port of reposurgeon earlier today.  I'm trying to verify
the conversion against the Subversion repository, but a full checkout
filled a filesystem on the EC2 instance I'm using. Recovery is
underway.

I'll do real benchmarks when I'm not staring at a deadline, but the
Go port is at least 20x faster than the Python was.  That makes
the conversion practical, though it turns out the 128GB on my 
desktop machine isn't enough to support it - hence the EC2 instance.

The first full conversion took eight hours.  Turns out the single most
computationally expensive part of the surgery is data-mining ChangeLog
files for commit attributions.  Today I threw massive parallelism at
the problem, that being something far easier to do in Go than in Python
- I think that might cut as much as two hours from the next run.

By going to the cloud I've gotten a larger working-set capacity at the
cost of some memory-access speed.  Didn't want to do that, but 
your repo is just too damn big for it to be otherwise, unless somebody
wants to drop cash on me to double the RAM in the Great Beast.

Your pile of requests is tricky but should be doable.

You had previously written:

>There are also cases where cvs2svn found a good branchpoint, but
>represented the branch-creation commit in a superfluously complicated
>way, replacing lots of files and subdirectories by copies of different
>revisions.

Yes, reposurgeon has logic to detect and deal with this automatically.
The assumption it makes is that the branch should root to the most
recent revision that CVS did a copy from. This is simple and seems to
give satisfactory results.

Which reminds me. I found a bunch of "svnmerge-integrated" properites
in the history. Should I treat those as though they were mergeinfo
properies and make branch merges from them?
-- 
http://www.catb.org/~esr/";>Eric S. Raymond




Go reposrgeon is production ready

2019-10-02 Thread Eric S. Raymond
Today I retired the original Python version of the reposurgeon code.

I plan to spend the next couple of days fixing minor bugs that
I was deferring until the Go port was finished.  Then I'll dive
back into the gcc conversion.

Barring an emergency on the NTPsec project, I should be able to 
concentrate on the conversion until it's done.
-- 
http://www.catb.org/~esr/";>Eric S. Raymond

Our society won't be truly free until "None of the Above" is always an option.


Re: Reposurgeon status

2019-09-26 Thread Eric S. Raymond
Joseph Myers :
> On Thu, 26 Sep 2019, Eric S. Raymond wrote:
> 
> > > You might want to update the state of reposurgeon on that page.
> > 
> > I will do so.
> 
> Note that once you've created an account, someone will need to add it to 
> the EditorGroup page before you can edit.

I'm having trouble with basic account creation, actually.  It's to
all appearances not accepting the password I set up. I have sent a 
reset request.
-- 
    http://www.catb.org/~esr/";>Eric S. Raymond




Re: Reposurgeon status

2019-09-26 Thread Eric S. Raymond
Jeff Law :
> Probably the most important thing to know is the project will make a
> decision on Dec 16 to choose the conversion tool.  The evaluation is
> based on the state of tool's conversion on that date.  More details:
> 
> > https://gcc.gnu.org/wiki/GitConversion
> 
> You should consider the dates in there as firm.

I think it is extremely likely that I will have a final conversion ready by 
then.

The only known problem that is in any way serious is the x-bit
propagation bug, and I may already have fixed that. I'd think I'd have
to get blindsided by something much larger to miss that deadline.

> You might want to update the state of reposurgeon on that page.

I will do so.
-- 
        http://www.catb.org/~esr/";>Eric S. Raymond




  1   2   3   >