Re: Indicating the mirror's origin

2023-06-15 Thread Konstantin Ryabitsev
On Wed, Jun 14, 2023 at 11:50:15PM +, Eric Wong wrote:
> Konstantin Ryabitsev  wrote:
> > Good day:
> > 
> > We've had a few requests to mirror public-inbox archives that originate on
> > other systems so they can also be searchable and viewable via 
> > lore.kernel.org.
> > I've been dragging my feet on these requests, because they are a potential
> > liability in terms of GDPR compliance.
> 
> I just tried using `git replace' for the first time:

I think I didn't quite convey my idea -- let me try to step back a bit.

What I have is lore.kernel.org, which is actually 3 different frontends all
pulling git repositories from some other source of origin. Currently, I have
two:

- lkml.kernel.org, which subscribes to external lists via regular SMTP
- subspace.kernel.org, which is our own mlmmj server and where public-inbox
  repositories are created via public-inbox-watch

Since we control both lkml and subspace, we are the origin of the data, so if
anyone requests archive removal, we can easily comply.

Now, I want to be able to add other external public-inbox repositories to be
mirrored on lore.kernel.org, but with some clear indication that we're not the
origin of that data, we're merely mirroring it. Any GDPR removal requests need
to be sent to $ORIGIN and we'll just propagate any changes.

>   git replace --edit $BLOB_OID

I don't want to go down that route, because while we can do such surgery on a
node, it would need to be rerun again if we bring up a new mirror node, and
it's almost guaranteed to be forgotten.

> I sometimes use the $INBOX_DIR/description file for that and it
> affects WWW and NNTP, but not IMAP/POP3.  I'm not sure if I want
> to reintroduce header injection in case there's some conflict
> with DKIM or other signature mechanisms[1]

I don't think we need to worry about it if we pick a header that's almost
certain to not be included in the default DKIM signature set.
X-Originally-Archived-At: or some other header is guaranteed to never be
signed.

-K



[PATCH] lei: make --dedupe=content always account for Message-IDs

2023-06-15 Thread Eric Wong
The content dedupe logic was originally designed for v2 public
inboxes as a fallback for when the importer sees identical
Message-IDs.  Thus it did not account for Message-ID(s) in
the message itself.

This change doesn't affect saved searches (the default when
writing to a pathname or IMAP).  It affects --no-save, and
outputs to stdout (even if stdout is redirected to a file).

Prior to this change, lei reused the v2 logic as-is without
accounting for Message-IDs anywhere with `--dedupe=content'
(the default).  This could cause messages to be skipped when
the content matches despite Message-IDs being different.

So with this change, `lei q --dedupe=content' will hash the
Message-ID(s) in the message to ensure messages with different
Message-IDs are NOT deduplicated.

Whether or not this change is a bug fix or introduces regression
is actually debatable.  In my mind, it is better to err on the
side of showing too many messages rather than too few, even if
the actual contents of the message are identical.  Making saved
searches deduplicate without accounting for Message-IDs would be
more difficult, too.
---
 lib/PublicInbox/ContentHash.pm | 15 +++
 lib/PublicInbox/LeiDedupe.pm   |  9 +++--
 t/lei_dedupe.t |  6 +-
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/lib/PublicInbox/ContentHash.pm b/lib/PublicInbox/ContentHash.pm
index fc94257c..95ca2929 100644
--- a/lib/PublicInbox/ContentHash.pm
+++ b/lib/PublicInbox/ContentHash.pm
@@ -54,16 +54,23 @@ sub content_dig_i {
$dig->add($s);
 }
 
-sub content_digest ($;$) {
-   my ($eml, $dig) = @_;
+sub content_digest ($;$$) {
+   my ($eml, $dig, $hash_mids) = @_;
$dig //= Digest::SHA->new(256);
 
# References: and In-Reply-To: get used interchangeably
# in some "duplicates" in LKML.  We treat them the same
# in SearchIdx, so treat them the same for this:
# do NOT consider the Message-ID as part of the content_hash
-   # if we got here, we've already got Message-ID reuse
-   my %seen = map { $_ => 1 } @{mids($eml)};
+   # if we got here, we've already got Message-ID reuse for v2.
+   #
+   # However, `lei q --dedupe=content' does use $hash_mids since
+   # it doesn't have any other dedupe
+   my $mids = mids($eml);
+   if ($hash_mids) {
+   $dig->add("mid\0$_\0") for @$mids;
+   }
+   my %seen = map { $_ => 1 } @$mids;
for (grep { !$seen{$_}++ } @{references($eml)}) {
utf8::encode($_);
$dig->add("ref\0$_\0");
diff --git a/lib/PublicInbox/LeiDedupe.pm b/lib/PublicInbox/LeiDedupe.pm
index 86cd8490..eda54d79 100644
--- a/lib/PublicInbox/LeiDedupe.pm
+++ b/lib/PublicInbox/LeiDedupe.pm
@@ -2,7 +2,7 @@
 # License: AGPL-3.0+ 
 package PublicInbox::LeiDedupe;
 use v5.12;
-use PublicInbox::ContentHash qw(content_hash git_sha);
+use PublicInbox::ContentHash qw(content_hash content_digest git_sha);
 use PublicInbox::SHA qw(sha256);
 
 # n.b. mutt sets most of these headers not sure about Bytes
@@ -69,7 +69,12 @@ sub dedupe_content ($) {
my ($skv) = @_;
(sub { # may be called in a child process
my ($eml) = @_; # $oidhex = $_[1], ignored
-   $skv->set_maybe(content_hash($eml), '');
+
+   # we must account for Message-ID via hash_mids, since
+   # (unlike v2 dedupe) Message-ID is not accounted for elsewhere:
+   $skv->set_maybe(content_digest($eml, PublicInbox::SHA->new(256),
+   1 # hash_mids
+   )->digest, '');
}, sub {
my ($smsg) = @_;
$skv->set_maybe(smsg_hash($smsg), '');
diff --git a/t/lei_dedupe.t b/t/lei_dedupe.t
index e1944d02..13fc1f3b 100644
--- a/t/lei_dedupe.t
+++ b/t/lei_dedupe.t
@@ -1,5 +1,5 @@
 #!perl -w
-# Copyright (C) 2020-2021 all contributors 
+# Copyright (C) all contributors 
 # License: AGPL-3.0+ 
 use strict;
 use v5.10.1;
@@ -10,6 +10,8 @@ use PublicInbox::Smsg;
 require_mods(qw(DBD::SQLite));
 use_ok 'PublicInbox::LeiDedupe';
 my $eml = eml_load('t/plack-qp.eml');
+my $sameish = eml_load('t/plack-qp.eml');
+$sameish->header_set('Message-ID', '');
 my $mid = $eml->header_raw('Message-ID');
 my $different = eml_load('t/msg_iter-order.eml');
 $different->header_set('Message-ID', $mid);
@@ -47,6 +49,8 @@ for my $strat (undef, 'content') {
ok(!$dd->is_dup($different), "different is_dup with $desc dedupe");
ok(!$dd->is_smsg_dup($smsg), "is_smsg_dup pass w/ $desc dedupe");
ok($dd->is_smsg_dup($smsg), "is_smsg_dup reject w/ $desc dedupe");
+   ok(!$dd->is_dup($sameish),
+   "Message-ID accounted for w/ same content otherwise");
 }
 $lei->{opt}->{dedupe} = 'bogus';
 eval { PublicInbox::LeiDedupe->new($lei) };



[PATCH] lei import: set +(L|kw) on already-imported blobs

2023-06-15 Thread Eric Wong
When import hits blobs it's already seen, we'll add labels
regardless in order to match the behavior of other inexact
matches.  This is useful when importing exact copies of
messages which exist in multiple mailboxes.

I noticed this when I had a message imported from my normal IMAP
`INBOX', but also copied it to a different folder for future
reference.
---
 Documentation/RelNotes/v2.0.0.wip |  3 +++
 lib/PublicInbox/LeiStore.pm   |  8 +++-
 t/lei-import.t| 17 -
 3 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/Documentation/RelNotes/v2.0.0.wip 
b/Documentation/RelNotes/v2.0.0.wip
index cd90bdae..cccf11ae 100644
--- a/Documentation/RelNotes/v2.0.0.wip
+++ b/Documentation/RelNotes/v2.0.0.wip
@@ -60,6 +60,9 @@ lei
   * fix `lei q -tt' on locally-indexed messages (still broken for remotes:
 https://public-inbox.org/meta/20230226170931.M947721@dcvr/ )
 
+  * `lei import' now set labels+keywords consistently on all
+ already-imported messages
+
 solver (used by lei (rediff|blob), and PublicInbox::WWW)
 
   * handle copies in patches properly
diff --git a/lib/PublicInbox/LeiStore.pm b/lib/PublicInbox/LeiStore.pm
index cf5a03a0..727de066 100644
--- a/lib/PublicInbox/LeiStore.pm
+++ b/lib/PublicInbox/LeiStore.pm
@@ -387,8 +387,14 @@ sub add_eml {
_lms_rw($self)->set_src($smsg->oidbin, @{$vmd->{sync_info}});
}
unless ($im_mark) { # duplicate blob returns undef
-   return unless wantarray;
+   return unless wantarray || $vmd;
my @docids = $oidx->blob_exists($smsg->{blob});
+   if ($vmd) {
+   for my $docid (@docids) {
+   my $idx = $eidx->idx_shard($docid);
+   _add_vmd($self, $idx, $docid, $vmd);
+   }
+   }
return _docids_and_maybe_kw $self, \@docids;
}
 
diff --git a/t/lei-import.t b/t/lei-import.t
index 6e9a853c..c9e668a3 100644
--- a/t/lei-import.t
+++ b/t/lei-import.t
@@ -1,5 +1,5 @@
 #!perl -w
-# Copyright (C) 2020-2021 all contributors 
+# Copyright (C) all contributors 
 # License: AGPL-3.0+ 
 use strict; use v5.10.1; use PublicInbox::TestCommon;
 test_lei(sub {
@@ -110,6 +110,21 @@ $res = json_utf8->decode($lei_out);
 is_deeply($res->[0]->{kw}, ['seen'], 'keyword set');
 is_deeply($res->[0]->{L}, ['inbox'], 'label set');
 
+# idempotent import can add label
+lei_ok([qw(import -F eml - +L:boombox)],
+   undef, { %$lei_opt, 0 => \$eml_str });
+lei_ok(qw(q m:in...@example.com));
+$res = json_utf8->decode($lei_out);
+is_deeply($res->[0]->{kw}, ['seen'], 'keyword remains set');
+is_deeply($res->[0]->{L}, [qw(boombox inbox)], 'new label added');
+
+# idempotent import can add keyword
+lei_ok([qw(import -F eml - +kw:answered)],
+   undef, { %$lei_opt, 0 => \$eml_str });
+lei_ok(qw(q m:in...@example.com));
+$res = json_utf8->decode($lei_out);
+is_deeply($res->[0]->{kw}, [qw(answered seen)], 'keyword added');
+is_deeply($res->[0]->{L}, [qw(boombox inbox)], 'labels preserved');
 
 # see t/lei_to_mail.t for "import -F mbox*"
 });