Re: publicinbox watch path globbing

2023-11-19 Thread Robin H. Johnson
On Mon, Nov 20, 2023 at 12:10:01AM +, Eric Wong wrote:
> "Robin H. Johnson"  wrote:
> > Hi!
> > 
> > Writing to see about work in converting Gentoo's (now-broken) other
> > archives web interface over into using public-inbox instead.
> > 
> > This is the first of a few questions/bumps along the way.
> > 
> > For historical reasons on the scaling side, the archive maildirs are
> > stored by date:
> > watch = maildir:$REDACTED/$LISTNAME/.21/
> > watch = maildir:$REDACTED/$LISTNAME/.200102/
> > watch = maildir:$REDACTED/$LISTNAME/.MM/
> > watch = maildir:$REDACTED/$LISTNAME/.202311/
> > etc.
> > (over time, directories are moved to stable read-only storage)
> 
> Is there any reason to expect new messages to appear the /.2000??/
> and other old directories?
> 
> IOW, if somebody with a broken clock sends a message from a past
> year/month in the Date: header, does it end up in an old bucket
> or the current one?
The date is based on arrival time at the archive ingest.

For some of the very old lists, we do have a list of message-ids that we
know existed but aren't captured in the archive, and those mails have
been added to the old locations if they are ever found (maybe once a
year).

> 
> If your old buckets are frozen, lei in public-inbox.git should be
> able to start them off with:
> 
>   for d in $REDACTED/$LISTNAME/.??
>   do
>   lei convert -o v2:/path/to/inbox-$LISTNAME maildir:$d
>   done
>   lei daemon-kill # optional, stops lei-daemon when done
> 
> And then you'd only have to watch the latest maildir.
Any concerns during the month rollover period?
E.g. making sure the 202310 & 202311 are both watched right as time
increments from October to November, because the archive ingest is
likely to write to 202311, but it's possible that public-inbox is still
run for the last few new messages in 202310 yet?

> > While I could generate the config file, I'm wondering about better
> > solution, to allow globbing the path.
> 
> I wanted to have recursive watches at some point but never got
> around to it.  So I guess something like this could work recursively:
>   watchglob = maildir:$REDACTED/$LISTNAME/**
> 
> > I tried to locate a single place in the codebase where this would be
> > applied, but it's not clear enough to me if there's a single place that
> > it can easily modified.
> 
> The `new' sub in lib/PublicInbox/Watch.pm sets up maildirs/imap/nntp
> 
> The glob2re function is better nowadays in public-inbox.git,
> and the mdre regexp will probably needs to be updated when it sees
> a new maildir...
Thanks. I'd want to explicitly scope the glob to the dates.
The spam processing has been to move spam to .spam.MM.

> > If there's a consistent place, I think the cleanest syntax that doesn't
> > break existing consumers would be something like this:
> > [publicinbox "$LISTNAME"]
> > watch = maildirglob:$REDACTED/$LISTNAME/.19/
> > watch = maildirglob:$REDACTED/$LISTNAME/.20/
> 
> I think `watchglob = maildir:...' is preferable since I don't
> want maildirglob: to be confused as a type.
Agreed, I see concerns there.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: PGP signature


Alternate permalink URLs - for migration from other/custom archive solutions

2023-11-19 Thread Robin H. Johnson
Hi,

This is more of a feature request / request for pointers on how to tweak
the design to support something, and it might be suited to maintaining
as a local patch.

The permalinks offered by public-inbox are great, but at Gentoo Linux,
we'd like to ALSO continue to offer our historical permalinks.

For those, the permalink slug portion was built when the mail arrived
into the archives ingest pipeline.

Example legacy link:
https://archives.gentoo.org/gentoo-dev/message/499b958da430b925dbd2f2b58e0f507e

We'd need to tweak the index somehow to expose it.

That same mail as visible in our public-inbox test site:
https://public-inbox.gentoo.org/gentoo-dev/538ce05eef3f4df3468cbc7f7abfa90eb2ea7d51.ca...@gentoo.org/raw

The permalink slug is in the header:
X-Archives-Hash: 499b958da430b925dbd2f2b58e0f507e

This needs to end up in the Xapian index (which doesn't seem to index
headers right now), and then get wired up as a route:
On access, redirect to public-inbox permalink.

Pointers on where in the codebase to wire up the Xapian side greatly
appreciated, since it doesn't seem to be indexing arbitrary headers
right now.

-- 
Robin Hugh Johnson
Pronouns   : They/he
E-Mail : robb...@orbis-terrarum.net


signature.asc
Description: PGP signature


publicinbox watch path globbing

2023-11-19 Thread Robin H. Johnson
Hi!

Writing to see about work in converting Gentoo's (now-broken) other
archives web interface over into using public-inbox instead.

This is the first of a few questions/bumps along the way.

For historical reasons on the scaling side, the archive maildirs are
stored by date:
watch = maildir:$REDACTED/$LISTNAME/.21/
watch = maildir:$REDACTED/$LISTNAME/.200102/
watch = maildir:$REDACTED/$LISTNAME/.MM/
watch = maildir:$REDACTED/$LISTNAME/.202311/
etc.
(over time, directories are moved to stable read-only storage)

If a given list is low traffic does NOT get traffic in a given month,
the directory does not exist (it's created when the first mail arrives
during a calendar month).

Multiply this by ~120 lists, and it gets on the large side for a config
file: 7500+ lines just for the "watch" entries.

While I could generate the config file, I'm wondering about better
solution, to allow globbing the path.

I tried to locate a single place in the codebase where this would be
applied, but it's not clear enough to me if there's a single place that
it can easily modified.

If there's a consistent place, I think the cleanest syntax that doesn't
break existing consumers would be something like this:
[publicinbox "$LISTNAME"]
watch = maildirglob:$REDACTED/$LISTNAME/.19/
watch = maildirglob:$REDACTED/$LISTNAME/.20/

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: PGP signature