Stephen Finucane <step...@that.guru> writes: > On Sun, 2019-12-08 at 22:14 +1100, Daniel Axtens wrote: >> Daniel Axtens <d...@axtens.net> writes: >> >> > > Currently, the 'SeriesReference' object has a unique constraint on the >> > > two fields it has, 'series', which is a foreign key to 'Series', and >> > > 'msgid'. This is the wrong constraint. What we actually want to enforce >> > > is that a patch, cover letter or comment is referenced by a single >> > > series, or rather a single series per project the submission appears on. >> > > As such, we should be enforcing uniqueness on the msgid and the project >> > > that the patch, cover letter or comment belongs to. >> > >> > OK, so I've been working on this and making good progress. >> > >> > My fundamental issue with the current design, and this extension of it, >> > is that we're ignoring or working against the tools that databases >> > provide for concurrency. This patch is an improvement, no doubt, but I'd >> > really like to solve it properly. >> > >> > Because we have multiple processes operating on the database - processes >> > without external syncronisation - we need to rely on the database to >> > provide the synchronisation primitive. Currently we're doing things like >> > asking the database to check if something exists and then creating it if >> > it does not. This is embeds TOCTOU problems, similar to the ones I fixed >> > for Person creation a while back. What we should be doing is >> > get_or_create style - basically more aggressively attempting to create >> > the thing we need and if that fails, then do the lookup for the existing >> > object. >> > >> > This design pervades the series parser. It doesn't generally bite us >> > because we generally, more or less, parse in series. >> > >> > I'm basically refactoring the parser to be >> > 'try-creating-and-catch-failures' rather than >> > 'lookup-and-create-if-missing'. It is a bit tricky but I think I'm at >> > least 80% of the way there. >> >> The more I think about this patch, the more I see it does actually move >> us in the right direction. Let's run with it for now. > > That was going to be my response to your previous reply: does anything > I'm doing in that series wrt *the models* not make sense, and if not, > should we apply it anyway since the parser is easily reworked again in > the future, but model changes are a little more permanent.
I had wanted to make it so that you couldn't insert multiple series for the same thread by adding the concept of a 'head message' to a series. Then you could create a unique_together with (head message id, project) and that would catch all the problems much earlier. I did end up coding that, and it mostly works, but there are still edgecases and bugs that I was working through. >> As for my rewrite, well, series parsing is really hard. I'm also working >> from a data set that has a whole set of bizzare things including series >> of patches in-reply-to a cover letter, a series of patches that's half >> threaded and half unthreaded and a patch series that contains - by >> design - multiple patches per "number" (i.e. 3 patches each numbered >> 1/N, 3 patches numbered 2/N...), so that means I'm forever fighting >> weird edge-cases of how people used git a decade ago. So I'm going to >> pass on trying to rewrite it for now. > > Any chance you can share this dataset? The only one I have to work with > is the Patchwork one which isn't that large. I did request the DPDK > archives some time back but I haven't checked whether what I received > even works. I have an mid-2017 copy of Ubuntu kernel-team mailing list archive, which I probably picked because I worked there at the time. There's still a link to download it online, and now it's even bigger: https://lists.ubuntu.com/archives/kernel-team/ - right at the top of the page there's a link to 'download the full raw archive'. Some things to note is that the archive does have a bunch of messages without Message-Ids(!) - looks like some poorly coded bots for regular messages about kernel builds. There's also the usual browsable mailman archive which I find very useful. I use it with my parallel parser, although I've reduce the runahead distance to 20. Some problems with working on this archive are that it's just _so_ large it's hard to manually check things, and some of the really broken behaviours are now over a decade old. (Not all though, check out 'Wily [PATCH 00/22]- Please enable kconfig X86_LEGACY_VM86 for i386' at https://lists.ubuntu.com/archives/kernel-team/2015-October/thread.html ) >> I'll go over it with a finer-toothed comb and do a proper review, but I >> withdraw my overall objection. Still going to do this, but I was thinking about it this morning and I think you also need to wrap the series/seriesreference manipulation in the coverletter path in an atomic context... you've made some changes to that path, was there a reason not to do the same atomic context wrapping? Regards, Daniel >> Regards, >> Daniel >> > I would be happy to leave my series parsing rework and Mete's work to a >> > 2.3, or I'd be happy to try and grind through, get these done, and make >> > 2.2 the final release in the 2.x series. What do you think? >> > >> > Regards, >> > Daniel >> > >> > Stephen Finucane <step...@that.guru> writes: >> > >> > > This requires adding a new field to the object, 'project', since it's >> > > not possible to do something like the following: >> > > >> > > unique_together = [('msgid', 'series__project')] >> > > >> > > This is detailed here [1]. In addition, the migration needs a precursor >> > > migration step to merge any broken series. >> > > >> > > [1] https://stackoverflow.com/a/4440189/613428 >> > > >> > > Signed-off-by: Stephen Finucane <step...@that.guru> >> > > Closes: #241 >> > > Cc: Daniel Axtens <d...@axtens.net> >> > > Cc: Petr Vorel <petr.vo...@gmail.com> >> > > --- >> > > v4: >> > > - Further tweaks to logging >> > > v3: >> > > - Improve logging >> > > --- >> > > .../0038_unique_series_references.py | 121 +++++++++++++++++ >> > > patchwork/models.py | 6 +- >> > > patchwork/parser.py | 123 +++++++++--------- >> > > patchwork/tests/test_parser.py | 9 +- >> > > patchwork/tests/utils.py | 6 +- >> > > 5 files changed, 199 insertions(+), 66 deletions(-) >> > > create mode 100644 patchwork/migrations/0038_unique_series_references.py >> > > >> > > diff --git a/patchwork/migrations/0038_unique_series_references.py >> > > b/patchwork/migrations/0038_unique_series_references.py >> > > new file mode 100644 >> > > index 00000000..91799020 >> > > --- /dev/null >> > > +++ b/patchwork/migrations/0038_unique_series_references.py >> > > @@ -0,0 +1,121 @@ >> > > +from django.db import connection, migrations, models >> > > +from django.db.models import Count >> > > +import django.db.models.deletion >> > > + >> > > + >> > > +def merge_duplicate_series(apps, schema_editor): >> > > + SeriesReference = apps.get_model('patchwork', 'SeriesReference') >> > > + Patch = apps.get_model('patchwork', 'Patch') >> > > + >> > > + msgid_seriesrefs = {} >> > > + >> > > + # find all SeriesReference that share a msgid but point to >> > > different series >> > > + # and decide which of the series is going to be the authoritative >> > > one >> > > + msgid_counts = ( >> > > + SeriesReference.objects.values('msgid') >> > > + .annotate(count=Count('msgid')) >> > > + .filter(count__gt=1) >> > > + ) >> > > + for msgid_count in msgid_counts: >> > > + msgid = msgid_count['msgid'] >> > > + chosen_ref = None >> > > + for series_ref in SeriesReference.objects.filter(msgid=msgid): >> > > + if series_ref.series.cover_letter: >> > > + if chosen_ref: >> > > + # I don't think this can happen, but explode if it >> > > does >> > > + raise Exception( >> > > + "Looks like you've got two or more series that >> > > share " >> > > + "some patches but do not share a cover letter. >> > > Unable " >> > > + "to auto-resolve." >> > > + ) >> > > + >> > > + # if a series has a cover letter, that's the one we'll >> > > group >> > > + # everything under >> > > + chosen_ref = series_ref >> > > + >> > > + if not chosen_ref: >> > > + # if none of the series have cover letters, simply use the >> > > last >> > > + # one (hint: this relies on Python's weird scoping for for >> > > loops >> > > + # where 'series_ref' is still accessible outside the loop) >> > > + chosen_ref = series_ref >> > > + >> > > + msgid_seriesrefs[msgid] = chosen_ref >> > > + >> > > + # reassign any patches referring to non-authoritative series to >> > > point to >> > > + # the authoritative one, and delete the other series; we do this >> > > separately >> > > + # to allow us a chance to raise the exception above if necessary >> > > + for msgid, chosen_ref in msgid_seriesrefs.items(): >> > > + for series_ref in SeriesReference.objects.filter(msgid=msgid): >> > > + if series_ref == chosen_ref: >> > > + continue >> > > + >> > > + # update the patches to point to our chosen series instead, >> > > on the >> > > + # assumption that all other metadata is correct >> > > + for patch in Patch.objects.filter(series=series_ref.series): >> > > + patch.series = chosen_ref.series >> > > + patch.save() >> > > + >> > > + # delete the other series (which will delete the series ref) >> > > + series_ref.series.delete() >> > > + >> > > + >> > > +def copy_project_field(apps, schema_editor): >> > > + if connection.vendor == 'postgresql': >> > > + schema_editor.execute( >> > > + """ >> > > + UPDATE patchwork_seriesreference >> > > + SET project_id = patchwork_series.project_id >> > > + FROM patchwork_series >> > > + WHERE patchwork_seriesreference.series_id = >> > > patchwork_series.id >> > > + """ >> > > + ) >> > > + elif connection.vendor == 'mysql': >> > > + schema_editor.execute( >> > > + """ >> > > + UPDATE patchwork_seriesreference, patchwork_series >> > > + SET patchwork_seriesreference.project_id = >> > > patchwork_series.project_id >> > > + WHERE patchwork_seriesreference.series_id = >> > > patchwork_series.id >> > > + """ >> > > + ) # noqa >> > > + else: >> > > + SeriesReference = apps.get_model('patchwork', 'SeriesReference') >> > > + >> > > + for series_ref in SeriesReference.objects.all().select_related( >> > > + 'series' >> > > + ): >> > > + series_ref.project = series_ref.series.project >> > > + series_ref.save() >> > > + >> > > + >> > > +class Migration(migrations.Migration): >> > > + >> > > + dependencies = [('patchwork', '0037_event_actor')] >> > > + >> > > + operations = [ >> > > + migrations.RunPython( >> > > + merge_duplicate_series, migrations.RunPython.noop, >> > > atomic=False >> > > + ), >> > > + migrations.AddField( >> > > + model_name='seriesreference', >> > > + name='project', >> > > + field=models.ForeignKey( >> > > + null=True, >> > > + on_delete=django.db.models.deletion.CASCADE, >> > > + to='patchwork.Project', >> > > + ), >> > > + ), >> > > + migrations.AlterUniqueTogether( >> > > + name='seriesreference', unique_together={('project', >> > > 'msgid')} >> > > + ), >> > > + migrations.RunPython( >> > > + copy_project_field, migrations.RunPython.noop, atomic=False >> > > + ), >> > > + migrations.AlterField( >> > > + model_name='seriesreference', >> > > + name='project', >> > > + field=models.ForeignKey( >> > > + on_delete=django.db.models.deletion.CASCADE, >> > > + to='patchwork.Project', >> > > + ), >> > > + ), >> > > + ] >> > > diff --git a/patchwork/models.py b/patchwork/models.py >> > > index 7f0efd48..dbaff024 100644 >> > > --- a/patchwork/models.py >> > > +++ b/patchwork/models.py >> > > @@ -79,7 +79,8 @@ class Project(models.Model): >> > > webscm_url = models.CharField(max_length=2000, blank=True) >> > > list_archive_url = models.CharField(max_length=2000, blank=True) >> > > list_archive_url_format = models.CharField( >> > > - max_length=2000, blank=True, >> > > + max_length=2000, >> > > + blank=True, >> > > help_text="URL format for the list archive's Message-ID >> > > redirector. " >> > > "{} will be replaced by the Message-ID.") >> > > commit_url_format = models.CharField( >> > > @@ -787,6 +788,7 @@ class SeriesReference(models.Model): >> > > required to handle the case whereby one or more patches are >> > > received before the cover letter. >> > > """ >> > > + project = models.ForeignKey(Project, on_delete=models.CASCADE) >> > > series = models.ForeignKey(Series, related_name='references', >> > > related_query_name='reference', >> > > on_delete=models.CASCADE) >> > > @@ -796,7 +798,7 @@ class SeriesReference(models.Model): >> > > return self.msgid >> > > >> > > class Meta: >> > > - unique_together = [('series', 'msgid')] >> > > + unique_together = [('project', 'msgid')] >> > > >> > > >> > > class Bundle(models.Model): >> > > diff --git a/patchwork/parser.py b/patchwork/parser.py >> > > index c424729b..c0084cc0 100644 >> > > --- a/patchwork/parser.py >> > > +++ b/patchwork/parser.py >> > > @@ -16,6 +16,7 @@ import re >> > > >> > > from django.contrib.auth.models import User >> > > from django.db.utils import IntegrityError >> > > +from django.db import transaction >> > > from django.utils import six >> > > >> > > from patchwork.models import Comment >> > > @@ -256,16 +257,9 @@ def _find_series_by_references(project, mail): >> > > for ref in refs: >> > > try: >> > > return SeriesReference.objects.get( >> > > - msgid=ref[:255], series__project=project).series >> > > + msgid=ref[:255], project=project).series >> > > except SeriesReference.DoesNotExist: >> > > continue >> > > - except SeriesReference.MultipleObjectsReturned: >> > > - # FIXME: Open bug: this can happen when we're processing >> > > - # messages in parallel. Pick the first and log. >> > > - logger.error("Multiple SeriesReferences for %s in project >> > > %s!" % >> > > - (ref[:255], project.name)) >> > > - return SeriesReference.objects.filter( >> > > - msgid=ref[:255], series__project=project).first().series >> > > >> > > >> > > def _find_series_by_markers(project, mail, author): >> > > @@ -1092,47 +1086,65 @@ def parse_mail(mail, list_id=None): >> > > except IntegrityError: >> > > raise DuplicateMailError(msgid=msgid) >> > > >> > > - # if we don't have a series marker, we will never have an >> > > existing >> > > - # series to match against. >> > > - series = None >> > > - if n: >> > > - series = find_series(project, mail, author) >> > > + for attempt in range(1, 11): # arbitrary retry count >> > > + try: >> > > + with transaction.atomic(): >> > > + # if we don't have a series marker, we will never >> > > have an >> > > + # existing series to match against. >> > > + series = None >> > > + if n: >> > > + series = find_series(project, mail, author) >> > > + else: >> > > + x = n = 1 >> > > + >> > > + # We will create a new series if: >> > > + # - there is no existing series to assign this >> > > patch to, or >> > > + # - there is an existing series, but it already has >> > > a patch >> > > + # with this number in it >> > > + if not series or Patch.objects.filter( >> > > + series=series, number=x).count(): >> > > + series = Series.objects.create( >> > > + project=project, >> > > + date=date, >> > > + submitter=author, >> > > + version=version, >> > > + total=n) >> > > + >> > > + # NOTE(stephenfin) We must save references for >> > > series. >> > > + # We do this to handle the case where a later >> > > patch is >> > > + # received first. Without storing references, >> > > it would >> > > + # not be possible to identify the relationship >> > > between >> > > + # patches as the earlier patch does not >> > > reference the >> > > + # later one. >> > > + for ref in refs + [msgid]: >> > > + ref = ref[:255] >> > > + # we don't want duplicates >> > > + try: >> > > + # we could have a ref to a previous >> > > series. >> > > + # (For example, a series sent in reply >> > > to >> > > + # another series.) That should not >> > > create a >> > > + # series ref for this series, so check >> > > for the >> > > + # msg-id only, not the msg-id/series >> > > pair. >> > > + SeriesReference.objects.get( >> > > + msgid=ref, project=project) >> > > + except SeriesReference.DoesNotExist: >> > > + SeriesReference.objects.create( >> > > + msgid=ref, project=project, >> > > series=series) >> > > + break >> > > + except IntegrityError: >> > > + # we lost the race so go again >> > > + logger.warning('Conflict while saving series. This is ' >> > > + 'probably because multiple patches >> > > belonging ' >> > > + 'to the same series have been received >> > > at ' >> > > + 'once. Trying again (attempt %02d/10)', >> > > + attempt) >> > > else: >> > > - x = n = 1 >> > > - >> > > - # We will create a new series if: >> > > - # - there is no existing series to assign this patch to, or >> > > - # - there is an existing series, but it already has a patch >> > > with this >> > > - # number in it >> > > - if not series or Patch.objects.filter(series=series, >> > > number=x).count(): >> > > - series = Series.objects.create( >> > > - project=project, >> > > - date=date, >> > > - submitter=author, >> > > - version=version, >> > > - total=n) >> > > - >> > > - # NOTE(stephenfin) We must save references for series. We >> > > - # do this to handle the case where a later patch is >> > > - # received first. Without storing references, it would not >> > > - # be possible to identify the relationship between patches >> > > - # as the earlier patch does not reference the later one. >> > > - for ref in refs + [msgid]: >> > > - ref = ref[:255] >> > > - # we don't want duplicates >> > > - try: >> > > - # we could have a ref to a previous series. (For >> > > - # example, a series sent in reply to another >> > > - # series.) That should not create a series ref >> > > - # for this series, so check for the msg-id only, >> > > - # not the msg-id/series pair. >> > > - SeriesReference.objects.get(msgid=ref, >> > > - series__project=project) >> > > - except SeriesReference.DoesNotExist: >> > > - SeriesReference.objects.create(series=series, >> > > msgid=ref) >> > > - except SeriesReference.MultipleObjectsReturned: >> > > - logger.error("Multiple SeriesReferences for %s" >> > > - " in project %s!" % (ref, >> > > project.name)) >> > > + # we failed to save the series so return the series-less >> > > patch >> > > + logger.warning('Failed to save series. Your patch with >> > > message ID ' >> > > + '%s has been saved but this should not >> > > happen. ' >> > > + 'Please report this as a bug and include >> > > logs', >> > > + msgid) >> > > + return patch >> > > >> > > # add to a series if we have found one, and we have a numbered >> > > # patch. Don't add unnumbered patches (for example diffs sent >> > > @@ -1170,14 +1182,9 @@ def parse_mail(mail, list_id=None): >> > > # message >> > > try: >> > > series = SeriesReference.objects.get( >> > > - msgid=msgid, series__project=project).series >> > > + msgid=msgid, project=project).series >> > > except SeriesReference.DoesNotExist: >> > > series = None >> > > - except SeriesReference.MultipleObjectsReturned: >> > > - logger.error("Multiple SeriesReferences for %s" >> > > - " in project %s!" % (msgid, project.name)) >> > > - series = SeriesReference.objects.filter( >> > > - msgid=msgid, series__project=project).first().series >> > > >> > > if not series: >> > > series = Series.objects.create( >> > > @@ -1190,12 +1197,8 @@ def parse_mail(mail, list_id=None): >> > > # we don't save the in-reply-to or references fields >> > > # for a cover letter, as they can't refer to the same >> > > # series >> > > - try: >> > > - SeriesReference.objects.get_or_create(series=series, >> > > - msgid=msgid) >> > > - except SeriesReference.MultipleObjectsReturned: >> > > - logger.error("Multiple SeriesReferences for %s" >> > > - " in project %s!" % (msgid, >> > > project.name)) >> > > + SeriesReference.objects.create( >> > > + msgid=msgid, project=project, series=series) >> > > >> > > try: >> > > cover_letter = CoverLetter.objects.create( >> > > diff --git a/patchwork/tests/test_parser.py >> > > b/patchwork/tests/test_parser.py >> > > index 0bf71585..0edbd87a 100644 >> > > --- a/patchwork/tests/test_parser.py >> > > +++ b/patchwork/tests/test_parser.py >> > > @@ -421,17 +421,20 @@ class SeriesCorrelationTest(TestCase): >> > > msgids = [make_msgid()] >> > > project = create_project() >> > > series_v1 = create_series(project=project) >> > > - create_series_reference(msgid=msgids[0], series=series_v1) >> > > + create_series_reference(msgid=msgids[0], series=series_v1, >> > > + project=project) >> > > >> > > # ...and three patches >> > > for i in range(3): >> > > msgids.append(make_msgid()) >> > > - create_series_reference(msgid=msgids[-1], series=series_v1) >> > > + create_series_reference(msgid=msgids[-1], series=series_v1, >> > > + project=project) >> > > >> > > # now create a new series with "cover letter" >> > > msgids.append(make_msgid()) >> > > series_v2 = create_series(project=project) >> > > - ref_v2 = create_series_reference(msgid=msgids[-1], >> > > series=series_v2) >> > > + ref_v2 = create_series_reference(msgid=msgids[-1], >> > > series=series_v2, >> > > + project=project) >> > > >> > > # ...and the "first patch" of this new series >> > > msgid = make_msgid() >> > > diff --git a/patchwork/tests/utils.py b/patchwork/tests/utils.py >> > > index 577183d0..4ead1781 100644 >> > > --- a/patchwork/tests/utils.py >> > > +++ b/patchwork/tests/utils.py >> > > @@ -282,8 +282,12 @@ def create_series(**kwargs): >> > > >> > > def create_series_reference(**kwargs): >> > > """Create 'SeriesReference' object.""" >> > > + project = kwargs.pop('project', create_project()) >> > > + series = kwargs.pop('series', create_series(project=project)) >> > > + >> > > values = { >> > > - 'series': create_series() if 'series' not in kwargs else None, >> > > + 'series': series, >> > > + 'project': project, >> > > 'msgid': make_msgid(), >> > > } >> > > values.update(**kwargs) >> > > -- >> > > 2.23.0 _______________________________________________ Patchwork mailing list Patchwork@lists.ozlabs.org https://lists.ozlabs.org/listinfo/patchwork