Re: [Gossip] Porting digested new list archives to mail-archive

2015-05-04 Thread Jeff Breidenbach
A more detailed response was sent over private mail, but the
short answers are (1) yes, as per FAQ (2) thanks for the
suggestion, will add it to the list of things to think about.
___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip

Re: [Gossip] Porting digested new list archives to mail-archive

2015-05-03 Thread Shahrukh Merchant
OK, I'm ready with the archives to upload, with 2 questions before I do 
so (actually, only the first question is related to the upload):


1. Shall I create the mail-archive entry first and have it start 
auto-archiving new mail and wait until that's up and running before 
sending you the old archives (to make sure that there are no gaps, for 
instance, and so that you already have a working repository to put them in)?


2. (Not directly related to my imminent import) In a previous question I 
had asked about having an "index" that allowed you to jump directly to a 
month/year rather than scrolling page by page via "Previous" and "Next" 
to which you replied



No active plans around the monthly index feature request. I
can only guarantee such a thing would not happen any time
soon. Same answer for the modifying the thread index. I do
appreciate the feedback and interest and will keep in mind.


which is understandable from the standpoint of it being a new feature. 
However, I noticed that the "Previous/Next" links go to a page that has 
fixed links like 
https://www.mail-archive.com/[email protected]/mail2.html, .../mail3.html, 
.../mail4.html, etc. (and ../thrdN.html for the thread view). So it 
would seem a relatively simple matter to have a fixed footer on each 
index page (whether by date or thread):


Page 1 2 3 4 5 6 7 8 9 10 ... going as far as the number of pages that 
exist. It would at least allow someone to one-click fast-forward to 
estimate a date (the current option is to realize that that's how it's 
structured by seeing the URL and then manually guessing at a number and 
editing the URL).


Shahrukh

___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip


Re: [Gossip] Porting digested new list archives to mail-archive

2015-04-18 Thread Shahrukh Merchant

This is probably going to work fine, and let's go ahead and give it a
try. If it doesn't work, we'll discuss, figure it out, and try again.


Yeah, I'm probably over-analyzing but getting to this point was 
definitely useful since I now have more confidence what my scripts need 
to do and what they don't.


Will need several more days before I have things in shape to send you, 
at which points I'll send an email directly to the support with the 
pertinent links to the files and a simple README with anything I think 
may help you figure out what is what.


Thanks for all the help and to Earl and Matt for chipping in too.

Shahrukh

___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip


Re: [Gossip] Porting digested new list archives to mail-archive

2015-04-18 Thread Jeff Breidenbach
Yes, you can safely leave out To, Message-id, and Received.
Consequences are what you'd expect, like the inability to do a
message-id search and find that particular message.

You are correct. Posting address is manually assigned during the bulk
import process, and automatically determined from headers for regular
inbound mail. Think of it as if Mail Archive was trying to put every
message in a folder, where the folder name is the posting address. The
folder name is indexed, so is available for search. You are exactly
correct about the 'l' parameter.

This is probably going to work fine, and let's go ahead and give it a
try. If it doesn't work, we'll discuss, figure it out, and try again.
___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip

Re: [Gossip] Porting digested new list archives to mail-archive

2015-04-18 Thread Shahrukh Merchant

Jeff,

Thanks again--more things are clear now. But your response raises more 
questions in my mind as well. Please bear with me, we're almost there I 
think.


Keeping in mind that I am splitting Digests into individual messages, I 
have to fake whatever headers are not already there within the 
individual messages. In the case of my digests, the existing headers are 
only:


Date:
From:
Subject:

To this I add a ^From_ line just before the date line to make it an mbox 
format (this is taken from the Digest header and copied in front of each 
email in the digest, so it's identical for several emails.


An example of this ^From_ line is this:
From - Mon Nov 28 11:00:05 2005

Now, with this background, here are my further questions:

1. I thought I would have to add a "To:" line to my "faked" headers, but 
you are saying that it is never used, is it OK if the To: line is 
completely absent?



The only things indexed for search are: message-id,


2. Will it choke if there is no Message-ID? I know this is used in 
threading but in an earlier email you said that in the absence of 
Message-ID it would thread on Subject, but just want to make sure it 
won't reject the email without a Message-ID field.



... subject  ... sender name (extracted from From: header)


No issue on these.


... date (usually extracted from the Recieved: header)


3. Hmm, again, there is no Received header. Will it properly take the 
date from the Date: field in that case and not choke on the absence of 
any Received headers? The date field format (from an actual example) is:

Date:Mon, 28 Nov 2005 17:55:14 +1100


... posting address (for example, [email protected] ... Every message is

> sorted and organized according to posting address ... but the To: header
> is never indexed for search, never used during import, and there is no
> benefit for you to adjust it.

4. This is where I'm most confused now. Where *is* the posting address 
extracted from if not from the To: header? Is it an internal field in 
your archive message database that is (a) predetermined manually in an 
import, (b) mapped to a fixed internally stored name for new incoming 
email (based on headers including To:) and nothing else? In your earlier 
response, where I had asked about the varying forms of To: addresses in 
my old archives that needed to be imported (e.g., [email protected], 
[email protected], [email protected]) in terms of confusing 
search (since I incorrectly imagined that the To: line would be looked 
for in the search), you had replied:



Search will have no concept of alternative list names. There is no reasonable 
way to overcome this.


but now you say search never looks at the To: lines and they aren't used 
in imports either. So in light of your latest response I don't 
understand now why search would have an issue of "alternative list 
names"--they are alternative To: lines but the same list, and the 
variations exist only in archives--new email would have a consistent To: 
line reflecting the current posting address.



A merged archive will have the same posting
address for every message, with no memory about what life was like
before the merge.


OK, this is consistent with the "To:" line never being used for search. 
So the "l=" parameter in the search would always have to be the new list 
name following a merge, correct? That shouldn't be an issue.


Shahrukh

___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip


Re: [Gossip] Porting digested new list archives to mail-archive

2015-04-17 Thread Jeff Breidenbach
The only things indexed for search are: message-id, subject, date (usually
extracted from the Recieved: header), sender name (extracted
from From: header), posting address (for example, [email protected]),
archival message number, and message body. Every message is sorted and
organized according to posting address.

The To: header is examined when sorting regular inbound mail and is a
factor when deciding where it belongs. But the To: header is never indexed
for search, never used during import, and there is no benefit for you to
adjust it. A merged archive will have the same posting address for every
message, with no memory about what life was like before the merge.
___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip

Re: [Gossip] Porting digested new list archives to mail-archive

2015-04-17 Thread Shahrukh Merchant

1. Yes, we override list name on import.


OK, so they are threaded and paginated independently of what's in the 
"To:" line.



2. Search will have no concept of alternative list names. There is no
reasonable way to overcome this.


Hmm, I don't understand this, given your answer to (1) above. If the 
primary list name is [email protected], then by your convention, it seems 
that going to www.mail-archive.com/[email protected]/ would bring up the 
"home page" for that list archive. There is no way to restrict a search 
just to "pages under this page"? I.e., that the only way to get a 
particular list is to match explicitly the contents of the "To:" field 
for that list?



3. Why not use the tool that Earl mentioned?


Because it seems with nmh I would have to do it manually on a digest by 
digest basis. So with hundreds of digests and thousands of created 
individual files which I would then need to remerge, I really need a 
script that processes an mbox of hundreds of Digests in one fell swoop, 
creating a slightly smaller compliant mbox of all the individual emails, 
which is the script that I have just finished writing. Besides, if I 
have to do additional processing, e.g., to replace old To: lines with 
the new list name to avoid the search problem, I need a script anyway 
(albeit a simpler one). (I'll make this script available once I'm done, 
so it will have been completely debugged and well tested, since it may 
be useful to others in the future.)



4. We always merge into the new list name and set up an HTTP redirect so
that the old URLs are not broken. Merges are done via a manual request
to support staff.


OK, that's clear. But then it would have the same search issue, would it 
not? I.e., that while the merged list would appear as one in the summary 
and threads pages, a searcher would have to search for one or the other 
(or use an OR operator in the search, which seems like another way to 
overcome this, i.e., create a search with a hidden field containing the 
OR of all variants).


Shahrukh

___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip


Re: [Gossip] Porting digested new list archives to mail-archive

2015-04-17 Thread Jeff Breidenbach
1. Yes, we override list name on import.
2. Search will have no concept of alternative list names. There is no
reasonable way to overcome this.
3. Why not use the tool that Earl mentioned?
4. We always merge into the new list name and set up an HTTP redirect so
that the old URLs are not broken. Merges are done via a manual request to
support staff.
___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip

Re: [Gossip] Porting digested new list archives to mail-archive

2015-04-17 Thread Shahrukh Merchant
Still working on processing the old email digests to convert them to 
individual emails in mbox format for import.


Meanwhile, though, I thought of a new issue which has to do with 
identifying the list name from the email headers, given that the list 
name (and consequently the email address in the To: line) varies.


The list is currently identified in the To: line as [email protected] and 
the last 10 years' archives should be consistent (that's the easy part).


Before that, though, it was hosted with Listserv software on a different 
host and was identified as [email protected] which in some very old 
digests ca. 1994 show up as tango-l%[email protected] (the 
uga.edu part could vary depending on what BITNET-to-Internet gateway was 
used). In addition, the list was called TANGO for a short while before 
it was changed to TANGO-L. In summary, we have the following, which are 
all the same list:


[email protected]
[email protected]
tango-l%[email protected] (with possibly gateways other than 
uga.edu)

[email protected]
tango%[email protected] (with possibly gateways other than 
uga.edu)


So, with this background, my questions are:

1. Since this is for a manual archive import (as opposed to incoming 
email that has to be filtered intelligently), you would have all the 
files and could presumably just force it to go to the same database. Is 
this true?


2. Even if you did that, though, would searches work reliably? I would 
want the first of the above addresses ([email protected]) to be the 
primary address by which the list would be identified, so for search 
queries, people would always identify the list as [email protected], as 
that's what it's been for the last 10 years. But would ALL the above 
variants then be included in the search? I.e., is there an internal tag 
created with the archive that identifies it with just one email address 
for queries, regardless of what's on the "To:" field of an individual 
message? No one is going to be searching for this list with anything 
other than "Tango-L" (most likely) or "[email protected]", notwithstanding 
the alternate forms.


3. Any other things to think about regarding this issue? I have resigned 
myself to writing scripts for splitting some old digests into 
mbox-format individual emails for import, so if there is something else 
I need to on each message to address this new issue, I could incorporate 
it into the script. I could just force all the "To:" headers to 
[email protected], obliterating all the previous forms, which clearly 
would solve the problem, but do I need to? (It is extra work, and some 
are already in mbox format that I would otherwise not need to touch, but 
would to make this change.)


4. If, in the future, the list were to move to a different host and have 
an address like [email protected] (a "permanent" change) how 
could I ensure that it continued to go to the same archive? And is there 
would be a single list tag in a search query that would get posts from 
that archive (and only that archive) regardless of whether they posts 
were pre-move or post-move? In this hypothetical situation indeed people 
may search with the mit.edu domain OR tango-L.com domain, and ideally it 
would call up the same combined archive.


(I did read the FAQ article on "My list splits into multiple archives" 
which touches on this issue, but it seems relevant mostly to incoming 
mail filtering rather than old archive processing.)


Shahrukh

On 4/15/2015 2:29 AM, Jeff Breidenbach wrote:

Statute of limitations is typically 3 kilomessages on a normal
non-import list, but should (I think) be unlimited on bulk import.
Conversion to unix newlines is required and is manual; doesn't
matter who does it.

Still prefer to do whole import at once especially if tricky; less
labor, also less likely to break URLs if it takes multiple attempts
to get it right.  But we can accommodate two stages. Imports are
done on weekends only.









___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip


Re: [Gossip] Porting digested new list archives to mail-archive

2015-04-14 Thread Jeff Breidenbach
Statute of limitations is typically 3 kilomessages on a normal
non-import list, but should (I think) be unlimited on bulk import.
Conversion to unix newlines is required and is manual; doesn't
matter who does it.

Still prefer to do whole import at once especially if tricky; less
labor, also less likely to break URLs if it takes multiple attempts
to get it right.  But we can accommodate two stages. Imports are
done on weekends only.
___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip

Re: [Gossip] Porting digested new list archives to mail-archive

2015-04-14 Thread Shahrukh Merchant

On 4/14/2015 9:25 PM, Jeff Breidenbach wrote:

* I recommend doing the import all at once, rather than in
stages. Not for technical reasons, it just saves manual labor.


OK, I may do it in 2 stages, since 1/2 the archives are in mbox format 
that can be imported instantly. The other half are in Digests that I 
need to create the scripts and process them through a few thousand 
emails, which could take some weeks. But no more than 2 stages, if 
that's OK ...?



* Happy to make a tarball of the HTML after the import. ...
Either way, you would be totally on your own from there;


Sounds good and more than reasonable.


* Threading is done MHonArc and discussed here.
http://www.mail-archive.com/faq.html#threading


OK, that was helpful. Basically that References: and In-Reply-To: are 
used first, but in their absence, Subject: matching is used. That works.


It didn't, however, answer my "statute of limitations" question, i.e., 
in the absence of Message-ID: clues, and relying on Subject: only, if 
there was a time span beyond which it would not link two likely 
different threads that happened to have the same subject (e.g., someone 
used the same subject but a year later). Not a big deal anyway--just 
curiosity.


On 4/14/2015 11:27 AM, Earl Hood wrote:

For MIME digest messages, MUAs like nmh are able to extract such
messages out into individual files, which can be subsequently packed
into mbox format.


I have hundreds of digests each with perhaps a dozen messages to 
process, so it needs to be a script that basically creates "mbox of all 
messages" from "mbox of Digests" in one (or a few) fell swoop. I have a 
psuedo-code awk program written that should do that, that I need to 
further code into real awk, but it should work, just a question of time 
to write and debug it, and then to process all the files through it and 
verify that it worked OK. I did find a sed script at 
http://sed.sourceforge.net/grabbag/scripts/splitdig.sed that allegedly 
does this, but while it's instructional, it doesn't seem robust enough, 
seemingly relying on any line with all dashes (as few as one!) as being 
a demarcation, whereas that could easily occur in the text (preceding a 
signature, for example, or as a separator). Besides, sed is a 
"write-only" language (you can program it as you go if you know it well 
enough, but good luck going back and figuring out what a sed program 
actually does and how to modify or tweak it!). :-)


One more question: Some of the "mbox" files that I propose to submit are 
in fact Thunderbird mail folders. As far as I can tell, they conform 
entirely with mbox format (leading ^From_ line, escaped >From if leading 
in body), but does anyone know of any "gotchas" with this?


Also, my files are all Windows based, but I assume CR/LF vs. LF is 
handled automatically and trivially, correct? (Or should I do the 
conversion to Unix format myself?)


Shahrukh

___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip


Re: [Gossip] Porting digested new list archives to mail-archive

2015-04-14 Thread Jeff Breidenbach
* I recommend doing the import all at once, rather than in
stages. Not for technical reasons, it just saves manual labor.

* Happy to make a tarball of the HTML after the import.
It will look like basic MHonArc output and cosmetically differ
quite a bit from what is served, because there is significant
cosmetic alteration during serving time.  If you choose to
scrape instead, I don't have a favorite tool to recommend.
Either way, you would be totally on your own from there;
we didn't design for this and would not help with harder
stuff like broken links, local search, etc.

* No active plans around the monthly index feature request. I
can only guarantee such a thing would not happen any time
soon. Same answer for the modifying the thread index. I do
appreciate the feedback and interest and will keep in mind.

* Threading is done MHonArc and discussed here.
http://www.mail-archive.com/faq.html#threading
___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip

Re: [Gossip] Porting digested new list archives to mail-archive

2015-04-14 Thread Earl Hood
On Mon, Apr 13, 2015 at 11:19 AM, Matt Morgan wrote:

>> 2. Now the harder one. From Sep 1994 (inception) to Apr 2006, the lists
>> were hosted using L-Soft's LISTSERV software, which did not keep archives.
>> However, I have a complete set of all traffic from that time period, but
>> they are all in Daily Digest format, i.e., with a "Table of Contents" in the
>> front and several emails afterwards. I have MOST (but not all) of these
>> available as MIME digests with each message in a different MIME multipart
>> segment. I also have ALL of them available as a non-MIME digest, with a
>> fixed text separator (like a row of ) between messages. I would propose
>> to send these as an mbox format of digest files but each email in each
>> digest message would still need to be separated out. (a) Can mail-archive do
>> this digest parsing, or do I need to find or write a script to do this
>> myself? (b) If mail-archive can do it, do you have a preference for MIME vs.
>> non-MIME digest? (c) And if MIME, can you handle the few for which I only
>> have non-MIME digests?
>
> I can't help with this one; skipping.

For MIME digest messages, MUAs like nmh are able to extract such
messages out into individual files, which can be subsequently packed
into mbox format.

If you have mhonarc installed, you can use the mha-decode with the
-dcd-digest option to have all digest messages extracted into separate
files.

--ewh

___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip


Re: [Gossip] Porting digested new list archives to mail-archive

2015-04-13 Thread Shahrukh Merchant
Thanks Matt and Jeff for your answers--they were very helpful. So, as I 
understand it:


- Sending link to mailman raw archives will take care of all posts from 
2006 to present (for which the mailman archives exist).


- Listserv digest format to individual email mbox format conversion (for 
the older 1994-2006 articles) I'll have to do on my own. :-( Well, I'm 
surprised it hasn't come up before but I guess I'll make the script 
available for others after I do it (more likely in awk and ksh than perl 
since that's what I know better, though I don't have a Unix system these 
days so who knows ...?).


- Sending archives in segments and out of chronological order is not a 
problem and threading and user interface will not be confused by this 
but simply and automatically (or manually by support each time they add 
import old archives?) recalculated. E.g., if I send the mailman mbox 
first (since it's easy) and the earlier digests later (since I have to 
work on it).


- You're willing to make an exception (or try ...) to have all posts 
indexed for my list, especially given that it's static, rather than just 
the last 3000 (but question on this below).


- I can mirror the archive on my own site if I want (again, question on 
this below).


Please let me know if any of the above is not correct.

Two further questions:

1. Does mirroring to my own site involve installing cgi-bin or .php 
scripts and so on? In which case are there instructions for that? Or 
just a wget-like static copy? It seems like the search and email 
obfuscation features at least would require scripts, no?


2. I noticed though that the user interface requires clicking one page 
at a time to go back +/- 100 messages or so at a time and there's no 
other navigation method. This seems unwieldy--I can see people getting 
tired of clicking more than 30 times to go back 3000 messages. Is it not 
possible to have a year/month hyperlinked index instead like


2015 - Jan | Feb | Mar | Apr | May (final)
2014 - Jan | Feb | ... | Dec
...
1995 ...
1994 - Apr (inception) | May | Jun | ... | Dec

on the side or top or bottom for direct access to the time frame of 
interest (plus of course Prev Page | Next Page to "scroll" within the 
month range in question? Of course I can do that with manually setting 
hyperlinks on my own backup version of the archive, especially given 
that it's static, but it would be great to have the mail-archive version 
do it directly.


3. Is threading based on subject-line matching only (and if so, what if 
the same subject happened to appear say 3 years later in a completely 
unrelated thread)? Or is it based on In-Reply-To:  type 
parsing and linking? (The Digests have these "extraneous" headers 
stripped out, leaving only To/From/Date/Subject, so I need to know if 
splitting the Digests into their individual posts will break threading.)


4. Is thread view, is there a way to have the date of the past appear 
next to each item (e.g., after the author name)? E.g.,


Re: Search returning 404   Jeff Breidenbach   2013-05-31

since the date provides a useful additional context in the threads view.

Regards,

Shahrukh

___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip


Re: [Gossip] Porting digested new list archives to mail-archive

2015-04-13 Thread Jeff Breidenbach
First, it is very common and super easy to directly import from a
mailman (pipermail)  archive. If the pipermail archive is publicly
online, just supply the URL to the support team.

The Mail Archive does not split digests back into individual messages.
That's way too scary. If a digest is presented during import, it
will be archived as a single large message. It might be possible to
split a digest into a multi-message mbox, but I have no experience
or advise to give. Consult your favorite internet search engine, the
mailman-users community, or perhaps someone here can give
guidance.

I'd like to clarify some terminology in the FAQ. Cold storage means
sitting on a disk in a closet somewhere, completely disconnected
with the internet. The only thing in cold storage is some raw mail
from years gone past. Anything we serve is in a processed format
and very much online. Everything imported is very much live and
online.

The 3000 limit is simply how far back index pages go; people get
tired clicking after a while and we also limit them for performance
reasons. So on one hand The Mail Archive doesn't really offer
what you are looking for, which is monthly indexes going back
forever. But you can get something similar using the search engine.
For extra fun, click on the expand button following the link. We
might make such links more prominent in the user interface
in the future, if there is demand. I've heard mixed feedback so
far.

http://www.mail-archive.com/search?q=date%3A200308*&l=gossip%40mail-archive.com

While customization is possible, we tend to forget about them after
enough time has passed (say 10 years) and then accidentally break
them with some code change. So I wouldn't generally encourage
that approach. If it did happen, no money would be involved. An
inactive list is more likely to 'maintain' this type of exception over
time than an active one. All right, fine, we can try it, on a best effort
basis; mention to the support team during import.

Regarding export, there's no problem for you to copy or mirror your
archive's HTML. Some people do this as an extra backup, which is
great. I haven't yet heard of folks doing this because of unhappiness
with the user interface.
___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip

Re: [Gossip] Porting digested new list archives to mail-archive

2015-04-13 Thread Matt Morgan

On 04/13/2015 12:57 AM, Shahrukh Merchant wrote:
I have two discussion lists on the Argentine Tango that are probably 
going to be suspended going forward owing to lack of activity in the 
face of many competing technologies in recent years, but that have a 
treasure of information dating back from 1994. 


Sounds very cool.

I would like to get these onto mail-archive but there are some 
peculiarities of the existing archives that I have some questions on.


Here are the questions:

1. First the easy one. From April 2006 to the present, the lists were 
hosted using mailman, so I have the complete raw mailman archives that 
I've downloaded. They are in one big mbox-format file (about 50 MB). 
(a) This is I suppose the most straightforward since I just send a 
pointer to these files and the mail-archive staff will do the rest, 
correct? (b) And am I correct that the single file is the best (rather 
than the monthly gzipped files? And (c) that the mail-archive software 
will recreate threads as necessary?


In my experience, both mbox and monthly gzipped files work fine. Threads 
were recreated fine in both cases.


2. Now the harder one. From Sep 1994 (inception) to Apr 2006, the 
lists were hosted using L-Soft's LISTSERV software, which did not keep 
archives. However, I have a complete set of all traffic from that time 
period, but they are all in Daily Digest format, i.e., with a "Table 
of Contents" in the front and several emails afterwards. I have MOST 
(but not all) of these available as MIME digests with each message in 
a different MIME multipart segment. I also have ALL of them available 
as a non-MIME digest, with a fixed text separator (like a row of ) 
between messages. I would propose to send these as an mbox format of 
digest files but each email in each digest message would still need to 
be separated out. (a) Can mail-archive do this digest parsing, or do I 
need to find or write a script to do this myself? (b) If mail-archive 
can do it, do you have a preference for MIME vs. non-MIME digest? (c) 
And if MIME, can you handle the few for which I only have non-MIME 
digests?


I can't help with this one; skipping.

3. Must these old archives be processed by mail-archive in 
chronological order in order for threading to work properly? Or if I 
provide older ones later are they automatically inserted and 
rethreaded appropriately?


They will be inserted and rethreaded appropriately.

4. The FAQ says that only the latest 3000 messages are kept live and 
the rest are in "cold storage" and can be retrieved only via matching 
searches. Some questions on this: (a) Are the "latest" based on when 
they were processed by the archive software (e.g., old archives 
processed recently would count as new)? Or (b) Are the "latest" based 
on the Date: field of the post in question? 


(b) is correct.

(c) Is there any way to get ALL messages live on mail-archive rather 
than only 3000 so they can be browsed for by month and year for 
example (e.g., by requesting an exception considering the list will be 
mothballed and won't be expanding, or by paying a donation/fee)? There 
is about 100 MB total of data per list, I'd guess.


Mail-archive support, when I asked about the 3000, was willing to talk 
about exceptions. I didn't pursue it so I can't say more in detail.


(d) If not, is there a way I can get a full mirror download that 
include the "cold storage" older archives (after processing by 
mail-archive's scripts) for me to install live on my own server (which 
may or may not disappear) while mail-archive still keeps it more 
permanently in their live+cold way?


I have to imagine that some tricks with wget or httrack should be able 
to do this, despite the "cold storage" aspect, but I'm only guessing. I 
would pursue the first question with mail-archive support and see what 
happens.



___
Gossip mailing list
https://www.mail-archive.com/[email protected]
https://www.mail-archive.com/cgi-bin/mailman/options/gossip