Re: [hackers] A better mailing list web archiver for suckless.org ... ?

Storkman Thu, 11 Aug 2022 22:34:43 -0700

On Wed, Aug 10, 2022 at 09:29:43PM +0200, Thomas Oltmann wrote:
> Hi all!
> 
> I think we can all agree that the current web archive over at
> lists.suckless.org isn't all that great;
> Author names get mangled, the navigation is terrible, some messages
> are duplicated, some missing.
> 
> That's why I've started looking into #3 of the 'Project Ideas' page
> (https://suckless.org/project_ideas/) -- "Write a decent mailing list
> Web archive system".
> I see lots of potential to build something better than hypermail:
> 
> - We could take text encodings more seriously.
>   hypermail just copies the 'charset' notice over into the HTML
>   file, which doesn't work when listing multiple messages.
> 
> - We could use maildir instead of the really brittle mbox format for 
> mailboxes.
>   This might also help avoid message dropping/duplication, but I'm not
> sure about that.
> 
> - We could try a different navigation scheme. Perhaps flat threads
> instead of a hierarchy?
>   I don't really know how people here feel about this, but it's
> mentioned on the 'Project Ideas' page
>   and I'm in favour of it. Navigating message trees is really confusing.
> 
> - Bonus: We can ignore CGI, uuencode, HTML mail and all that cruft.
> 
> Is there currently any interest in such a project here?
> 
> So far, I've gone ahead and implemented a sort of proof-of-concept (at
> https://www.github.com/tomolt/mailarchiver).
> Of course I can't guarantee that this will go anywhere, as I only have
> limited time and patience myself, but I can give it a try.
> 
> Cheers,
>           Thomas Oltmann
>


Hi!

When you list all these features, it sounds like everything a mailing list
archive front-end does just replicates things our mail clients already
do better, and without going through a web browser.

So I thought, why not just serve the maildir files as-is, with monthly
and yearly tarballs, and perhaps metadata files so you don't need to
download everything just to make sure you've got an entire thread?
But then, that would require additional instrumentation and would make e.g.
referencing mailing list threads in commit messages slightly less convenient.

In any case, I messed with the code a bit, running it on my own archive
maildir. I've constructed a very crude threaded view[1], and came up with a
few fixes in the process.

Patch 2 is a rewrite of collapse_ws(), because I found it really hard to
figure out what exactly it does and how. Your mileage may vary, but I
think the original code would overflow the buffer backwards when given
an empty input.

For patch 3, I've found some e-mails in the wild that used a lowercase
encoding in encoded-words, and the RFC says it's okay.

Patch 4 might not be correct, because I'm not sure how decode_qprintable()
can ever return without error when parsing an encoded-word in a header.
It seems that it would just find the last "=" in "?=", set length to -2,
and return NULL. Maybe I'm just not getting it. It did manage to process
a few dozen more e-mails in my test runs, though.

Hopefully I did this correctly and you can cherry-pick these commits
to your taste.

-- Storkman

  [1]: https://imgur.com/a/EbOblHt

>From 0627261f197a1bdacdca24d36fb0aee9e5d2c11b Mon Sep 17 00:00:00 2001
From: Paul Storkman <stork...@storkman.nl>
Date: Thu, 11 Aug 2022 22:32:37 +0200
Subject: [PATCH 1/4] Rearrange process_field to reduce repetition.

---
 mailarchiver.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/mailarchiver.c b/mailarchiver.c
index 4e52d78..5b9001e 100644
--- a/mailarchiver.c
+++ b/mailarchiver.c
@@ -350,19 +350,7 @@ process_field(char *key, char *value)
 {
        struct token token;
 
-       if (!strcasecmp(key, "From")) {
-               collapse_ws(value);
-               if (!decode_encwords(value)) return false;
-               mail.from = value;
-       } else if (!strcasecmp(key, "Subject")) {
-               collapse_ws(value);
-               if (!decode_encwords(value)) return false;
-               mail.subject = value;
-       } else if (!strcasecmp(key, "Date")) {
-               collapse_ws(value);
-               if (!decode_encwords(value)) return false;
-               mail.date = value;
-       } else if (!strcasecmp(key, "Content-Transfer-Encoding")) {
+       if (!strcasecmp(key, "Content-Transfer-Encoding")) {
                token = TOKEN_INIT(value);
                if (tokenize(&token) != TOKEN_ATOM) return false;
                if (!strcasecmp(token.atom, "7bit")) {
@@ -378,6 +366,17 @@ process_field(char *key, char *value)
                } else {
                        return false;
                }
+               return true;
+       }
+
+       collapse_ws(value);
+       if (!decode_encwords(value)) return false;
+       if (!strcasecmp(key, "From")) {
+               mail.from = value;
+       } else if (!strcasecmp(key, "Subject")) {
+               mail.subject = value;
+       } else if (!strcasecmp(key, "Date")) {
+               mail.date = value;
        }
        return true;
 }
-- 
2.34.1

>From 516cd3c9e8f3cac77b1bfcee84eb323008ff506e Mon Sep 17 00:00:00 2001
From: Paul Storkman <stork...@storkman.nl>
Date: Thu, 11 Aug 2022 22:34:32 +0200
Subject: [PATCH 2/4] Rewrite collapse_ws to be more legible.

---
 mailarchiver.c | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/mailarchiver.c b/mailarchiver.c
index 5b9001e..9d11768 100644
--- a/mailarchiver.c
+++ b/mailarchiver.c
@@ -144,24 +144,21 @@ parse_header(char *header, bool (*field_cb)(char *key, 
char *value))
        return true;
 }
 
+/* Converts each run of whitespace in str to a single space. */
 void
 collapse_ws(char *str)
 {
-       char *rhead, *whead;
-       bool inws;
-
-       rhead = whead = str;
-       inws = true;
-       while (*rhead) {
-               if (is_ws(*rhead)) {
-                       if (!inws) inws = true, *whead++ = ' ';
-               } else {
-                       inws = false, *whead++ = *rhead;
+       char *src, *dst;
+       src = dst = str;
+       while (is_ws(*src)) src++;
+       while (*src) {
+               *dst++ = *src++;
+               if (is_ws(*src)) {
+                       while (is_ws(*src)) src++;
+                       if (*src) *dst++ = ' ';
                }
-               rhead++;
        }
-       if (inws) --whead;
-       *whead = '\0';
+       *dst = 0;
 }
 
 static bool
-- 
2.34.1

>From 29bf6e8b5739b9642946afcf1a250129f853c1bd Mon Sep 17 00:00:00 2001
From: Paul Storkman <stork...@storkman.nl>
Date: Fri, 12 Aug 2022 02:28:55 +0200
Subject: [PATCH 3/4] Make encoding names in encoded-words case-insensitive.

RFC 2047 specifies that
"Both 'encoding' and 'charset' names are case-independent."
---
 mailarchiver.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mailarchiver.c b/mailarchiver.c
index 9d11768..a64973d 100644
--- a/mailarchiver.c
+++ b/mailarchiver.c
@@ -298,7 +298,7 @@ decode_base64(char *rhead, char *whead, size_t length)
 }
 
 /* Decode any 'Encoded Words' of the form =?charset?encoding?content?=
- * that may appear in header fields. */
+ * that may appear in header fields. See RFC 2047. */
 bool
 decode_encwords(char *str)
 {
@@ -315,7 +315,7 @@ decode_encwords(char *str)
                if (!(mark = strchr(rhead, '?'))) return false;
                rhead = mark + 1;
 
-               if (*rhead != 'Q' && *rhead != 'B') return false;
+               if (*rhead != 'Q' && *rhead != 'q' && *rhead != 'B' && *rhead 
!= 'b') return false;
                encoding = *rhead++;
                if (*rhead != '?') return false;
                rhead++;
@@ -323,7 +323,7 @@ decode_encwords(char *str)
                if (!(mark = strchr(rhead, '?'))) return false;
                if (mark[1] != '=') return false;
 
-               if (encoding == 'Q') {
+               if (encoding == 'Q' || encoding == 'q') {
                        for (c = rhead; c < mark; c++) {
                                if (*c == '_') *c = ' ';
                        }
-- 
2.34.1

>From 180e48d3e334bb720245d9d905228b86a721c57f Mon Sep 17 00:00:00 2001
From: Paul Storkman <stork...@storkman.nl>
Date: Fri, 12 Aug 2022 03:23:16 +0200
Subject: [PATCH 4/4] Terminate decode_qprintable when parsing encoded-words in
 headers.

---
 mailarchiver.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mailarchiver.c b/mailarchiver.c
index a64973d..3e70296 100644
--- a/mailarchiver.c
+++ b/mailarchiver.c
@@ -231,9 +231,12 @@ restart:
 char *
 decode_qprintable(char *rhead, char *whead, size_t length)
 {
-       char *eq;
+       char *eq, *end;
+       end = rhead + length;
 
        while ((eq = strchr(rhead, '='))) {
+               if (eq >= end)
+                       break;
                memmove(whead, rhead, eq - rhead);
                whead += eq - rhead;
                length -= eq - rhead + 1;
-- 
2.34.1

Re: [hackers] A better mailing list web archiver for suckless.org ... ?

Reply via email to