Re: [Gossip] Porting digested new list archives to mail-archive
A more detailed response was sent over private mail, but the short answers are (1) yes, as per FAQ (2) thanks for the suggestion, will add it to the list of things to think about. ___ Gossip mailing list https://www.mail-archive.com/[email protected] https://www.mail-archive.com/cgi-bin/mailman/options/gossip
Re: [Gossip] Porting digested new list archives to mail-archive
OK, I'm ready with the archives to upload, with 2 questions before I do so (actually, only the first question is related to the upload): 1. Shall I create the mail-archive entry first and have it start auto-archiving new mail and wait until that's up and running before sending you the old archives (to make sure that there are no gaps, for instance, and so that you already have a working repository to put them in)? 2. (Not directly related to my imminent import) In a previous question I had asked about having an "index" that allowed you to jump directly to a month/year rather than scrolling page by page via "Previous" and "Next" to which you replied No active plans around the monthly index feature request. I can only guarantee such a thing would not happen any time soon. Same answer for the modifying the thread index. I do appreciate the feedback and interest and will keep in mind. which is understandable from the standpoint of it being a new feature. However, I noticed that the "Previous/Next" links go to a page that has fixed links like https://www.mail-archive.com/[email protected]/mail2.html, .../mail3.html, .../mail4.html, etc. (and ../thrdN.html for the thread view). So it would seem a relatively simple matter to have a fixed footer on each index page (whether by date or thread): Page 1 2 3 4 5 6 7 8 9 10 ... going as far as the number of pages that exist. It would at least allow someone to one-click fast-forward to estimate a date (the current option is to realize that that's how it's structured by seeing the URL and then manually guessing at a number and editing the URL). Shahrukh ___ Gossip mailing list https://www.mail-archive.com/[email protected] https://www.mail-archive.com/cgi-bin/mailman/options/gossip
Re: [Gossip] Porting digested new list archives to mail-archive
This is probably going to work fine, and let's go ahead and give it a try. If it doesn't work, we'll discuss, figure it out, and try again. Yeah, I'm probably over-analyzing but getting to this point was definitely useful since I now have more confidence what my scripts need to do and what they don't. Will need several more days before I have things in shape to send you, at which points I'll send an email directly to the support with the pertinent links to the files and a simple README with anything I think may help you figure out what is what. Thanks for all the help and to Earl and Matt for chipping in too. Shahrukh ___ Gossip mailing list https://www.mail-archive.com/[email protected] https://www.mail-archive.com/cgi-bin/mailman/options/gossip
Re: [Gossip] Porting digested new list archives to mail-archive
Yes, you can safely leave out To, Message-id, and Received. Consequences are what you'd expect, like the inability to do a message-id search and find that particular message. You are correct. Posting address is manually assigned during the bulk import process, and automatically determined from headers for regular inbound mail. Think of it as if Mail Archive was trying to put every message in a folder, where the folder name is the posting address. The folder name is indexed, so is available for search. You are exactly correct about the 'l' parameter. This is probably going to work fine, and let's go ahead and give it a try. If it doesn't work, we'll discuss, figure it out, and try again. ___ Gossip mailing list https://www.mail-archive.com/[email protected] https://www.mail-archive.com/cgi-bin/mailman/options/gossip
Re: [Gossip] Porting digested new list archives to mail-archive
Jeff, Thanks again--more things are clear now. But your response raises more questions in my mind as well. Please bear with me, we're almost there I think. Keeping in mind that I am splitting Digests into individual messages, I have to fake whatever headers are not already there within the individual messages. In the case of my digests, the existing headers are only: Date: From: Subject: To this I add a ^From_ line just before the date line to make it an mbox format (this is taken from the Digest header and copied in front of each email in the digest, so it's identical for several emails. An example of this ^From_ line is this: From - Mon Nov 28 11:00:05 2005 Now, with this background, here are my further questions: 1. I thought I would have to add a "To:" line to my "faked" headers, but you are saying that it is never used, is it OK if the To: line is completely absent? The only things indexed for search are: message-id, 2. Will it choke if there is no Message-ID? I know this is used in threading but in an earlier email you said that in the absence of Message-ID it would thread on Subject, but just want to make sure it won't reject the email without a Message-ID field. ... subject ... sender name (extracted from From: header) No issue on these. ... date (usually extracted from the Recieved: header) 3. Hmm, again, there is no Received header. Will it properly take the date from the Date: field in that case and not choke on the absence of any Received headers? The date field format (from an actual example) is: Date:Mon, 28 Nov 2005 17:55:14 +1100 ... posting address (for example, [email protected] ... Every message is > sorted and organized according to posting address ... but the To: header > is never indexed for search, never used during import, and there is no > benefit for you to adjust it. 4. This is where I'm most confused now. Where *is* the posting address extracted from if not from the To: header? Is it an internal field in your archive message database that is (a) predetermined manually in an import, (b) mapped to a fixed internally stored name for new incoming email (based on headers including To:) and nothing else? In your earlier response, where I had asked about the varying forms of To: addresses in my old archives that needed to be imported (e.g., [email protected], [email protected], [email protected]) in terms of confusing search (since I incorrectly imagined that the To: line would be looked for in the search), you had replied: Search will have no concept of alternative list names. There is no reasonable way to overcome this. but now you say search never looks at the To: lines and they aren't used in imports either. So in light of your latest response I don't understand now why search would have an issue of "alternative list names"--they are alternative To: lines but the same list, and the variations exist only in archives--new email would have a consistent To: line reflecting the current posting address. A merged archive will have the same posting address for every message, with no memory about what life was like before the merge. OK, this is consistent with the "To:" line never being used for search. So the "l=" parameter in the search would always have to be the new list name following a merge, correct? That shouldn't be an issue. Shahrukh ___ Gossip mailing list https://www.mail-archive.com/[email protected] https://www.mail-archive.com/cgi-bin/mailman/options/gossip
Re: [Gossip] Porting digested new list archives to mail-archive
The only things indexed for search are: message-id, subject, date (usually extracted from the Recieved: header), sender name (extracted from From: header), posting address (for example, [email protected]), archival message number, and message body. Every message is sorted and organized according to posting address. The To: header is examined when sorting regular inbound mail and is a factor when deciding where it belongs. But the To: header is never indexed for search, never used during import, and there is no benefit for you to adjust it. A merged archive will have the same posting address for every message, with no memory about what life was like before the merge. ___ Gossip mailing list https://www.mail-archive.com/[email protected] https://www.mail-archive.com/cgi-bin/mailman/options/gossip
Re: [Gossip] Porting digested new list archives to mail-archive
1. Yes, we override list name on import. OK, so they are threaded and paginated independently of what's in the "To:" line. 2. Search will have no concept of alternative list names. There is no reasonable way to overcome this. Hmm, I don't understand this, given your answer to (1) above. If the primary list name is [email protected], then by your convention, it seems that going to www.mail-archive.com/[email protected]/ would bring up the "home page" for that list archive. There is no way to restrict a search just to "pages under this page"? I.e., that the only way to get a particular list is to match explicitly the contents of the "To:" field for that list? 3. Why not use the tool that Earl mentioned? Because it seems with nmh I would have to do it manually on a digest by digest basis. So with hundreds of digests and thousands of created individual files which I would then need to remerge, I really need a script that processes an mbox of hundreds of Digests in one fell swoop, creating a slightly smaller compliant mbox of all the individual emails, which is the script that I have just finished writing. Besides, if I have to do additional processing, e.g., to replace old To: lines with the new list name to avoid the search problem, I need a script anyway (albeit a simpler one). (I'll make this script available once I'm done, so it will have been completely debugged and well tested, since it may be useful to others in the future.) 4. We always merge into the new list name and set up an HTTP redirect so that the old URLs are not broken. Merges are done via a manual request to support staff. OK, that's clear. But then it would have the same search issue, would it not? I.e., that while the merged list would appear as one in the summary and threads pages, a searcher would have to search for one or the other (or use an OR operator in the search, which seems like another way to overcome this, i.e., create a search with a hidden field containing the OR of all variants). Shahrukh ___ Gossip mailing list https://www.mail-archive.com/[email protected] https://www.mail-archive.com/cgi-bin/mailman/options/gossip
Re: [Gossip] Porting digested new list archives to mail-archive
1. Yes, we override list name on import. 2. Search will have no concept of alternative list names. There is no reasonable way to overcome this. 3. Why not use the tool that Earl mentioned? 4. We always merge into the new list name and set up an HTTP redirect so that the old URLs are not broken. Merges are done via a manual request to support staff. ___ Gossip mailing list https://www.mail-archive.com/[email protected] https://www.mail-archive.com/cgi-bin/mailman/options/gossip
Re: [Gossip] Porting digested new list archives to mail-archive
Still working on processing the old email digests to convert them to individual emails in mbox format for import. Meanwhile, though, I thought of a new issue which has to do with identifying the list name from the email headers, given that the list name (and consequently the email address in the To: line) varies. The list is currently identified in the To: line as [email protected] and the last 10 years' archives should be consistent (that's the easy part). Before that, though, it was hosted with Listserv software on a different host and was identified as [email protected] which in some very old digests ca. 1994 show up as tango-l%[email protected] (the uga.edu part could vary depending on what BITNET-to-Internet gateway was used). In addition, the list was called TANGO for a short while before it was changed to TANGO-L. In summary, we have the following, which are all the same list: [email protected] [email protected] tango-l%[email protected] (with possibly gateways other than uga.edu) [email protected] tango%[email protected] (with possibly gateways other than uga.edu) So, with this background, my questions are: 1. Since this is for a manual archive import (as opposed to incoming email that has to be filtered intelligently), you would have all the files and could presumably just force it to go to the same database. Is this true? 2. Even if you did that, though, would searches work reliably? I would want the first of the above addresses ([email protected]) to be the primary address by which the list would be identified, so for search queries, people would always identify the list as [email protected], as that's what it's been for the last 10 years. But would ALL the above variants then be included in the search? I.e., is there an internal tag created with the archive that identifies it with just one email address for queries, regardless of what's on the "To:" field of an individual message? No one is going to be searching for this list with anything other than "Tango-L" (most likely) or "[email protected]", notwithstanding the alternate forms. 3. Any other things to think about regarding this issue? I have resigned myself to writing scripts for splitting some old digests into mbox-format individual emails for import, so if there is something else I need to on each message to address this new issue, I could incorporate it into the script. I could just force all the "To:" headers to [email protected], obliterating all the previous forms, which clearly would solve the problem, but do I need to? (It is extra work, and some are already in mbox format that I would otherwise not need to touch, but would to make this change.) 4. If, in the future, the list were to move to a different host and have an address like [email protected] (a "permanent" change) how could I ensure that it continued to go to the same archive? And is there would be a single list tag in a search query that would get posts from that archive (and only that archive) regardless of whether they posts were pre-move or post-move? In this hypothetical situation indeed people may search with the mit.edu domain OR tango-L.com domain, and ideally it would call up the same combined archive. (I did read the FAQ article on "My list splits into multiple archives" which touches on this issue, but it seems relevant mostly to incoming mail filtering rather than old archive processing.) Shahrukh On 4/15/2015 2:29 AM, Jeff Breidenbach wrote: Statute of limitations is typically 3 kilomessages on a normal non-import list, but should (I think) be unlimited on bulk import. Conversion to unix newlines is required and is manual; doesn't matter who does it. Still prefer to do whole import at once especially if tricky; less labor, also less likely to break URLs if it takes multiple attempts to get it right. But we can accommodate two stages. Imports are done on weekends only. ___ Gossip mailing list https://www.mail-archive.com/[email protected] https://www.mail-archive.com/cgi-bin/mailman/options/gossip
Re: [Gossip] Porting digested new list archives to mail-archive
Statute of limitations is typically 3 kilomessages on a normal non-import list, but should (I think) be unlimited on bulk import. Conversion to unix newlines is required and is manual; doesn't matter who does it. Still prefer to do whole import at once especially if tricky; less labor, also less likely to break URLs if it takes multiple attempts to get it right. But we can accommodate two stages. Imports are done on weekends only. ___ Gossip mailing list https://www.mail-archive.com/[email protected] https://www.mail-archive.com/cgi-bin/mailman/options/gossip
Re: [Gossip] Porting digested new list archives to mail-archive
On 4/14/2015 9:25 PM, Jeff Breidenbach wrote: * I recommend doing the import all at once, rather than in stages. Not for technical reasons, it just saves manual labor. OK, I may do it in 2 stages, since 1/2 the archives are in mbox format that can be imported instantly. The other half are in Digests that I need to create the scripts and process them through a few thousand emails, which could take some weeks. But no more than 2 stages, if that's OK ...? * Happy to make a tarball of the HTML after the import. ... Either way, you would be totally on your own from there; Sounds good and more than reasonable. * Threading is done MHonArc and discussed here. http://www.mail-archive.com/faq.html#threading OK, that was helpful. Basically that References: and In-Reply-To: are used first, but in their absence, Subject: matching is used. That works. It didn't, however, answer my "statute of limitations" question, i.e., in the absence of Message-ID: clues, and relying on Subject: only, if there was a time span beyond which it would not link two likely different threads that happened to have the same subject (e.g., someone used the same subject but a year later). Not a big deal anyway--just curiosity. On 4/14/2015 11:27 AM, Earl Hood wrote: For MIME digest messages, MUAs like nmh are able to extract such messages out into individual files, which can be subsequently packed into mbox format. I have hundreds of digests each with perhaps a dozen messages to process, so it needs to be a script that basically creates "mbox of all messages" from "mbox of Digests" in one (or a few) fell swoop. I have a psuedo-code awk program written that should do that, that I need to further code into real awk, but it should work, just a question of time to write and debug it, and then to process all the files through it and verify that it worked OK. I did find a sed script at http://sed.sourceforge.net/grabbag/scripts/splitdig.sed that allegedly does this, but while it's instructional, it doesn't seem robust enough, seemingly relying on any line with all dashes (as few as one!) as being a demarcation, whereas that could easily occur in the text (preceding a signature, for example, or as a separator). Besides, sed is a "write-only" language (you can program it as you go if you know it well enough, but good luck going back and figuring out what a sed program actually does and how to modify or tweak it!). :-) One more question: Some of the "mbox" files that I propose to submit are in fact Thunderbird mail folders. As far as I can tell, they conform entirely with mbox format (leading ^From_ line, escaped >From if leading in body), but does anyone know of any "gotchas" with this? Also, my files are all Windows based, but I assume CR/LF vs. LF is handled automatically and trivially, correct? (Or should I do the conversion to Unix format myself?) Shahrukh ___ Gossip mailing list https://www.mail-archive.com/[email protected] https://www.mail-archive.com/cgi-bin/mailman/options/gossip
Re: [Gossip] Porting digested new list archives to mail-archive
* I recommend doing the import all at once, rather than in stages. Not for technical reasons, it just saves manual labor. * Happy to make a tarball of the HTML after the import. It will look like basic MHonArc output and cosmetically differ quite a bit from what is served, because there is significant cosmetic alteration during serving time. If you choose to scrape instead, I don't have a favorite tool to recommend. Either way, you would be totally on your own from there; we didn't design for this and would not help with harder stuff like broken links, local search, etc. * No active plans around the monthly index feature request. I can only guarantee such a thing would not happen any time soon. Same answer for the modifying the thread index. I do appreciate the feedback and interest and will keep in mind. * Threading is done MHonArc and discussed here. http://www.mail-archive.com/faq.html#threading ___ Gossip mailing list https://www.mail-archive.com/[email protected] https://www.mail-archive.com/cgi-bin/mailman/options/gossip
Re: [Gossip] Porting digested new list archives to mail-archive
On Mon, Apr 13, 2015 at 11:19 AM, Matt Morgan wrote: >> 2. Now the harder one. From Sep 1994 (inception) to Apr 2006, the lists >> were hosted using L-Soft's LISTSERV software, which did not keep archives. >> However, I have a complete set of all traffic from that time period, but >> they are all in Daily Digest format, i.e., with a "Table of Contents" in the >> front and several emails afterwards. I have MOST (but not all) of these >> available as MIME digests with each message in a different MIME multipart >> segment. I also have ALL of them available as a non-MIME digest, with a >> fixed text separator (like a row of ) between messages. I would propose >> to send these as an mbox format of digest files but each email in each >> digest message would still need to be separated out. (a) Can mail-archive do >> this digest parsing, or do I need to find or write a script to do this >> myself? (b) If mail-archive can do it, do you have a preference for MIME vs. >> non-MIME digest? (c) And if MIME, can you handle the few for which I only >> have non-MIME digests? > > I can't help with this one; skipping. For MIME digest messages, MUAs like nmh are able to extract such messages out into individual files, which can be subsequently packed into mbox format. If you have mhonarc installed, you can use the mha-decode with the -dcd-digest option to have all digest messages extracted into separate files. --ewh ___ Gossip mailing list https://www.mail-archive.com/[email protected] https://www.mail-archive.com/cgi-bin/mailman/options/gossip
Re: [Gossip] Porting digested new list archives to mail-archive
Thanks Matt and Jeff for your answers--they were very helpful. So, as I understand it: - Sending link to mailman raw archives will take care of all posts from 2006 to present (for which the mailman archives exist). - Listserv digest format to individual email mbox format conversion (for the older 1994-2006 articles) I'll have to do on my own. :-( Well, I'm surprised it hasn't come up before but I guess I'll make the script available for others after I do it (more likely in awk and ksh than perl since that's what I know better, though I don't have a Unix system these days so who knows ...?). - Sending archives in segments and out of chronological order is not a problem and threading and user interface will not be confused by this but simply and automatically (or manually by support each time they add import old archives?) recalculated. E.g., if I send the mailman mbox first (since it's easy) and the earlier digests later (since I have to work on it). - You're willing to make an exception (or try ...) to have all posts indexed for my list, especially given that it's static, rather than just the last 3000 (but question on this below). - I can mirror the archive on my own site if I want (again, question on this below). Please let me know if any of the above is not correct. Two further questions: 1. Does mirroring to my own site involve installing cgi-bin or .php scripts and so on? In which case are there instructions for that? Or just a wget-like static copy? It seems like the search and email obfuscation features at least would require scripts, no? 2. I noticed though that the user interface requires clicking one page at a time to go back +/- 100 messages or so at a time and there's no other navigation method. This seems unwieldy--I can see people getting tired of clicking more than 30 times to go back 3000 messages. Is it not possible to have a year/month hyperlinked index instead like 2015 - Jan | Feb | Mar | Apr | May (final) 2014 - Jan | Feb | ... | Dec ... 1995 ... 1994 - Apr (inception) | May | Jun | ... | Dec on the side or top or bottom for direct access to the time frame of interest (plus of course Prev Page | Next Page to "scroll" within the month range in question? Of course I can do that with manually setting hyperlinks on my own backup version of the archive, especially given that it's static, but it would be great to have the mail-archive version do it directly. 3. Is threading based on subject-line matching only (and if so, what if the same subject happened to appear say 3 years later in a completely unrelated thread)? Or is it based on In-Reply-To: type parsing and linking? (The Digests have these "extraneous" headers stripped out, leaving only To/From/Date/Subject, so I need to know if splitting the Digests into their individual posts will break threading.) 4. Is thread view, is there a way to have the date of the past appear next to each item (e.g., after the author name)? E.g., Re: Search returning 404 Jeff Breidenbach 2013-05-31 since the date provides a useful additional context in the threads view. Regards, Shahrukh ___ Gossip mailing list https://www.mail-archive.com/[email protected] https://www.mail-archive.com/cgi-bin/mailman/options/gossip
Re: [Gossip] Porting digested new list archives to mail-archive
First, it is very common and super easy to directly import from a mailman (pipermail) archive. If the pipermail archive is publicly online, just supply the URL to the support team. The Mail Archive does not split digests back into individual messages. That's way too scary. If a digest is presented during import, it will be archived as a single large message. It might be possible to split a digest into a multi-message mbox, but I have no experience or advise to give. Consult your favorite internet search engine, the mailman-users community, or perhaps someone here can give guidance. I'd like to clarify some terminology in the FAQ. Cold storage means sitting on a disk in a closet somewhere, completely disconnected with the internet. The only thing in cold storage is some raw mail from years gone past. Anything we serve is in a processed format and very much online. Everything imported is very much live and online. The 3000 limit is simply how far back index pages go; people get tired clicking after a while and we also limit them for performance reasons. So on one hand The Mail Archive doesn't really offer what you are looking for, which is monthly indexes going back forever. But you can get something similar using the search engine. For extra fun, click on the expand button following the link. We might make such links more prominent in the user interface in the future, if there is demand. I've heard mixed feedback so far. http://www.mail-archive.com/search?q=date%3A200308*&l=gossip%40mail-archive.com While customization is possible, we tend to forget about them after enough time has passed (say 10 years) and then accidentally break them with some code change. So I wouldn't generally encourage that approach. If it did happen, no money would be involved. An inactive list is more likely to 'maintain' this type of exception over time than an active one. All right, fine, we can try it, on a best effort basis; mention to the support team during import. Regarding export, there's no problem for you to copy or mirror your archive's HTML. Some people do this as an extra backup, which is great. I haven't yet heard of folks doing this because of unhappiness with the user interface. ___ Gossip mailing list https://www.mail-archive.com/[email protected] https://www.mail-archive.com/cgi-bin/mailman/options/gossip
Re: [Gossip] Porting digested new list archives to mail-archive
On 04/13/2015 12:57 AM, Shahrukh Merchant wrote: I have two discussion lists on the Argentine Tango that are probably going to be suspended going forward owing to lack of activity in the face of many competing technologies in recent years, but that have a treasure of information dating back from 1994. Sounds very cool. I would like to get these onto mail-archive but there are some peculiarities of the existing archives that I have some questions on. Here are the questions: 1. First the easy one. From April 2006 to the present, the lists were hosted using mailman, so I have the complete raw mailman archives that I've downloaded. They are in one big mbox-format file (about 50 MB). (a) This is I suppose the most straightforward since I just send a pointer to these files and the mail-archive staff will do the rest, correct? (b) And am I correct that the single file is the best (rather than the monthly gzipped files? And (c) that the mail-archive software will recreate threads as necessary? In my experience, both mbox and monthly gzipped files work fine. Threads were recreated fine in both cases. 2. Now the harder one. From Sep 1994 (inception) to Apr 2006, the lists were hosted using L-Soft's LISTSERV software, which did not keep archives. However, I have a complete set of all traffic from that time period, but they are all in Daily Digest format, i.e., with a "Table of Contents" in the front and several emails afterwards. I have MOST (but not all) of these available as MIME digests with each message in a different MIME multipart segment. I also have ALL of them available as a non-MIME digest, with a fixed text separator (like a row of ) between messages. I would propose to send these as an mbox format of digest files but each email in each digest message would still need to be separated out. (a) Can mail-archive do this digest parsing, or do I need to find or write a script to do this myself? (b) If mail-archive can do it, do you have a preference for MIME vs. non-MIME digest? (c) And if MIME, can you handle the few for which I only have non-MIME digests? I can't help with this one; skipping. 3. Must these old archives be processed by mail-archive in chronological order in order for threading to work properly? Or if I provide older ones later are they automatically inserted and rethreaded appropriately? They will be inserted and rethreaded appropriately. 4. The FAQ says that only the latest 3000 messages are kept live and the rest are in "cold storage" and can be retrieved only via matching searches. Some questions on this: (a) Are the "latest" based on when they were processed by the archive software (e.g., old archives processed recently would count as new)? Or (b) Are the "latest" based on the Date: field of the post in question? (b) is correct. (c) Is there any way to get ALL messages live on mail-archive rather than only 3000 so they can be browsed for by month and year for example (e.g., by requesting an exception considering the list will be mothballed and won't be expanding, or by paying a donation/fee)? There is about 100 MB total of data per list, I'd guess. Mail-archive support, when I asked about the 3000, was willing to talk about exceptions. I didn't pursue it so I can't say more in detail. (d) If not, is there a way I can get a full mirror download that include the "cold storage" older archives (after processing by mail-archive's scripts) for me to install live on my own server (which may or may not disappear) while mail-archive still keeps it more permanently in their live+cold way? I have to imagine that some tricks with wget or httrack should be able to do this, despite the "cold storage" aspect, but I'm only guessing. I would pursue the first question with mail-archive support and see what happens. ___ Gossip mailing list https://www.mail-archive.com/[email protected] https://www.mail-archive.com/cgi-bin/mailman/options/gossip
