On 25/12/2019 11:02, Roman Zhuykov wrote: > First of all thanks to everyone who spent time making the conversion > better and better. Here is my 2c, I have studied a little my colleagues > trunk history in Maxim's gcc-pretty vs gcc-reposurgeon-5b. > > 1) In gcc-pretty timezone info is lost in both author/commiter date > (keeping UTC time correct, certainly). Examples are r278990 and r289989. > Probably git-svn causes this, current read-only git mirror is also > without timezone. Not sure we need that info, but reposurgeon is more > correct here. > > 2) Some thoughts about script for summarizing commit log messages: > 2a) Why r143753 and r150680 not have "re PR..." summary instead of > "[multiple changes]" ?
Both of these commits have more than one hunk with different authors, that triggers the heuristic to detect multiple independent changes that have been merged into a single commit (this is most common on branches, where it is common to aggregate a large number of backport commits - for those, using the first PR found as a key is rarely right even if there is only one PR mentioned). As Joseph said, providing a specific rule for these cases is possible. > 2b) On the contrary r155892 have to mention two PRs, even "[multiple > changes]" is better here, IMHO. For this one, the heuristic is that the PRs are likely to be related and that either *could* form a summary. We choose the first one mentioned. Looking at the commit log I cannot tell whether this is two independent fixes rolled into a single commit or two bugs that relate to the same change. > 2c) In r130050 and r155902 we have "Rename too ... " in summary, not > sure how to make it better. Well, we *could* try to extract the target function name for the rename of the traget, but I'm not about to try that right now. > 2d) r146882 can have better summary if we somehow organize ChangeLog > priority (gcc/ChangeLog is more important that testsuite one). Commit logs follow conventions quite weakly (if they followed them more strongly we wouldn't need the script at all, since all of them would have a summary line already ;-) As such, parsing them is not easy. The real question you need to answer is along the lines of: is 20071210-2.c: New testcase. a *less* useful summary than gcc/testsuite/Changelog: (which is what will appear if we do nothing in this case)? Unless the answer to that is 'no' then my script is working as intended, which is to try to produce something more useful than we would get by default. If we had another year, it might be possible to develop some AI that read the commit log, searched the mailing list the relevant email for the commit and then proceeded to solve the halting problem as something trivial along the side, but we aren't going to wait that long ;) More importantly, if you see commits with particularly egregious summaries in the trial conversions, please file a ticket at https://gitlab.com/esr/gcc-conversion.git and we can look into what suitable action might be needed. R. > > 3) About author emails, see below > 24.12.2019 21:14, Segher Boessenkool wrote: >> On Tue, Dec 24, 2019 at 05:16:54PM +0000, Joseph Myers wrote: >>> On Tue, 24 Dec 2019, Segher Boessenkool wrote: >>>>> That's because that commit also edits ChangeLog entries from other >>>>> authors. When a commit adds / edits ChangeLog entries for more >>>>> than one >>>>> author (the difference between purely editing an existing entry and >>>>> adding >>>>> a new one, possibly under an existing date/author header, for a >>>>> multi-author commit, is not something that can reliably be determined >>>>> automatically), the conversion falls back to using the committer >>>>> identity >>>>> instead of picking one of the multiple relevant authors from the >>>>> ChangeLog >>>>> files. >>>> There is only one relevant author in r270511. It edits a few wrong >>>> path >>>> names in the previous changelog entries. People often do similar >>>> things >>>> (like fixing the commit date :-) ) >>> Distinguishing "edits a previous ChangeLog entry" from "adds a new entry >>> under a previous ChangeLog header for a change included in the >>> commit" is >>> a human judgement. >> We are doing only one conversion here, the one of the GCC repo. The >> heuristic works, we checked it did. >> >>>> Either never use <account>@gcc.gnu.org, or always use it, don't do the >>>> worst of both worlds? >>> The heuristics here are to use an attribution from ChangeLog for the >>> author where unambiguous, but to use the committer (always >>> @gcc.gnu.org / >>> @gnu.org [*], so avoiding attributions at the wrong company even where >>> people were using multiple addresses simultaneously for different >>> changes) >>> as author if in doubt. >> You never need that, and it is worse to use two different schemes than to >> choose either. >> >> I would have chosen the "<account>@gcc.gnu.org" scheme, because it is >> simple and *correct*. Other people wanted the nicer names. Maxim's >> conversion gets that correct. Please copy it. >> > IMHO Segher is a bit categorical is the discussion, but I'll be glad to > see brief description of Maxim's approach to manage emails, gcc-pretty > shows better results. > Speaking about the script counting authors from ChangeLog files, even if > we drop an "edits a previous ChangeLog entry" issue, it still sometimes > work not as Joseph described: > 3a) In r155892, r155893 and r259314 Alex is not counted as the only > author without any reason. > 3b) In r139854, r141108 and r196252 script selected the author > successfully, while actually there are more that one. > 3c) Maybe here we can also somehow organize ChangeLog priority (again, > gcc/ChangeLog is more important that testsuite one). There are a lot of > examples, when testsuite/ChangeLog entry have another author: r145055, > r150680, r155889, r155894, r155890, r163904, r180186 and r183325. > 3d) If we fix 3b+3c we can also look at r143753, r155890 and r155895. > 3e) r155891, r207422, r183627 and r234218 are examples of commits which > don't touch any ChangeLog files for different reasons. Seems unsolvable > in current approach. > > -- > Roman