Hi Joey, On Sun, Dec 16, 2018 at 09:21:35AM -1000, Joey Pabalinas wrote: > > > I spent a lot of time trying to find an LKML archive in Maildir format > > > that I could use for local searches with nutmuch or something, but all > > > the links I was able to find were all dead. > > > > You might instead use > > > > https://www.kernel.org/lore.html > > https://git.kernel.org/pub/scm/public-inbox/vger.kernel.org/git.git/ > > That was my first attempt, but the ducumentation for the public-inbox > format is sort of terrible, and after a few hours trying to convert it > to Maildir I just gave up. > > I ended up just slowly scraping lkml.org for a couple weeks so I > wouldn't disrupt anything and it worked fairly well. Just looking for > advice on where to host this now so others might be able to use it.
Now you've caught my attention; first of all, there are more than 3M messages stored in the lkml.org datase, so I guess you've missed some messages or something is really broken. Besides, unless you figured out how to get to the raw data, you've just scraped a rendering which discards stuff like pgp signatures etc and has very incomplete headers. Unless you don't care for those of course :) Note that I've also been toying with the lore dataset, and wrote a tiny tool to get Maildir-like data out of it; this code is a bit of a single-use-jig so you'll need to do some coding if you really want to use it. Attached anyway. All the best and enjoy, Jasper
[[source]] url = "https://pypi.org/simple" verify_ssl = true name = "pypi" [packages] gitpython = "*" ipython = "*" [dev-packages] [requires] python_version = "3.7"
from email.parser import BytesParser from email.message import EmailMessage from email.policy import default from git import Repo our_last_id = '<[email protected]>' #'<[email protected]>' repo = Repo('/Users/spaans/xsrc/lkml/lkml/git/6.git') commit = repo.commit("master") counter = 5000 froms = set() while True: tree = commit.tree blob = tree['m'] data = blob.data_stream.read() msg = BytesParser(policy=default).parsebytes(data) msgid = msg['Message-ID'] from_ = msg['From'] froms.add(from_) print(msgid) #import pdb; pdb.set_trace() if len(froms) > 1000: print("HAVE LOTS OF FRIENDS NOW") break if msgid == our_last_id: print("LADIES & GENTLEMEN, WE'VE GOT HIM") break parents = commit.parents if len(parents) != 1: print("WUH") break else: commit = commit.parents[0] #with open("output/%04d.eml" % counter, "bw") as f: # f.write(data) counter -= 1 import pprint pprint.pprint(froms)
signature.asc
Description: PGP signature

