Re: [Mailman-Users] distributing Mailman between 2 systems

2006-06-04 Thread Jim Popovitch
Wow! thank you Richard.  Apologies for the top post, but I didn't quite 
know where to jump-in on your comments, and I didn't want to truncate 
any of them either.   Thank you for the detailed info.

My issue is that the Mailman host is quite busy processing in/outbound 
email and providing Mailman web access to listinfo, admin, options, etc. 
(via Apache)  When rundig (via nightly_htdig) runs it consumes almost 
all the host CPU resources (5min loadavg hovers at 11% for 5+ hours). 
This has an adverse affect on response times for Apache and the MTA due 
to file system issues (multiple drives and partitions, but in the end 
still just IDE)

My issue isn't so much that I need to move pipermail or the archives to 
a different host but rather just the indexing of them.  I like your 
mailman+htdig integration for pw protected lists, so splitting Mailman 
up from the archives would probably break that or at least make it 
difficult.

Additionally I would like to run nightly_htdig in a continuous loop on 
the second host, constantly cycling through lists and re-digging each as 
needed.

I know how to tweak htdig.conf so that indexing on one host can return 
htsearch results for another.  I know that NFS could help out, but I'm 
forever concerned about NFS security and network issues because I lease 
external hosts from multiple providers and their will always be wan 
issues between them.

What I am thinking of doing (based on your's and Mark's comments) is to 
rsync archive/private/* to a second host and run nightly_htdig on that 
system.  I can then use Apache config to redirect the htsearch queries 
to the second host, and have the results returned point back at the 
primary Mailman host.  I *think* this will work, but need to test.

-Jim P.

Richard Barrett wrote:
> 
> On 4 Jun 2006, at 06:41, Jim Popovitch wrote:
> 
>> I would like to move the pipermail archives to a different host then the
>> main Mailman system.  Specifically for better archive searching
>> performance with htdig.  Is this possible?
>>
>> -Jim P.
> 
> How you approach this depends on what you perceive your problem to be 
> and what you mean by "better archive searching performance with htdig".
> 
> Like Google and other internet search engines, htdig splits the task 
> into two parts: index construction and index search.
> 
> Index construction does the heavy lifting of scanning the source 
> material and squirreling away in its indices a lot of detail of which 
> indexed source files contain what. This can be quite a slow process 
> especially when a large body of material has to be initially scanned and 
> indexed. It is probably best treated as a batch process run a times of 
> light load from other work on the system doing it. Depending on the 
> material concerned and how you configure htdig this indexing may produce 
> very large indices which can come close to being in the same order of 
> magnitude of storage size as the raw source material. Many lists with 
> large indices can generate demand for much CPU and potentially much 
> storage during indexing (and after in the case of storage).
> 
> On the other hand index searching to produce a list of source files that 
> match the search criteria induces a much lower load on the system 
> concerned; after all it is just looking up words in pre-built search 
> indices.
> 
> The problem with this approach is that search indices are never 
> completely up-to-the-minute; but consider how often does Google's 
> crawler visit your web site. While updating search indices when new 
> documents are added to the archive material should be less load-inducing 
> than the original construction of the indices, configuring cron jobs so 
> that htdig rebuilds it indices too frequently is not advisable. The 
> updating of indices can still involve a lot of IO as htdig walks a lot 
> of files to determine which of the existing material has been changed as 
> well as what has been added as new.
> 
> So first you should define what problem you are trying to solve as 
> regards to using htdig before deciding what to do next.
> 
> You could plan on having your HTML mail archives integrated with Mailman 
> e.g. using pipermail or a pipermail/MHonArc synthesis for the archive 
> pages and having htdig integrated with that; I know you are aware of the 
> patches available to support this approach and that there are some 
> benefits as regard archive privacy being maintained and such. I will 
> deal with this integrated approach first. You could deploy multiple 
> processors to address the issues by using NFS to share the mailman 
> archive storage space between them.
> 
> Paranthetically, I successfully ran Mailman on x86 Linux boxes entirely 
> out of NFS mounted storage on enterprise level servers for a number 
> years, primarily to provide for rapid-ish switchover to a backup server 
> in the case of primary Mailman server hardware failure, which happened 
> on several occasions. At the time I found that I

Re: [Mailman-Users] distributing Mailman between 2 systems

2006-06-04 Thread Richard Barrett

On 4 Jun 2006, at 06:41, Jim Popovitch wrote:

> I would like to move the pipermail archives to a different host  
> then the
> main Mailman system.  Specifically for better archive searching
> performance with htdig.  Is this possible?
>
> -Jim P.

How you approach this depends on what you perceive your problem to be  
and what you mean by "better archive searching performance with htdig".

Like Google and other internet search engines, htdig splits the task  
into two parts: index construction and index search.

Index construction does the heavy lifting of scanning the source  
material and squirreling away in its indices a lot of detail of which  
indexed source files contain what. This can be quite a slow process  
especially when a large body of material has to be initially scanned  
and indexed. It is probably best treated as a batch process run a  
times of light load from other work on the system doing it. Depending  
on the material concerned and how you configure htdig this indexing  
may produce very large indices which can come close to being in the  
same order of magnitude of storage size as the raw source material.  
Many lists with large indices can generate demand for much CPU and  
potentially much storage during indexing (and after in the case of  
storage).

On the other hand index searching to produce a list of source files  
that match the search criteria induces a much lower load on the  
system concerned; after all it is just looking up words in pre-built  
search indices.

The problem with this approach is that search indices are never  
completely up-to-the-minute; but consider how often does Google's  
crawler visit your web site. While updating search indices when new  
documents are added to the archive material should be less load- 
inducing than the original construction of the indices, configuring  
cron jobs so that htdig rebuilds it indices too frequently is not  
advisable. The updating of indices can still involve a lot of IO as  
htdig walks a lot of files to determine which of the existing  
material has been changed as well as what has been added as new.

So first you should define what problem you are trying to solve as  
regards to using htdig before deciding what to do next.

You could plan on having your HTML mail archives integrated with  
Mailman e.g. using pipermail or a pipermail/MHonArc synthesis for the  
archive pages and having htdig integrated with that; I know you are  
aware of the patches available to support this approach and that  
there are some benefits as regard archive privacy being maintained  
and such. I will deal with this integrated approach first. You could  
deploy multiple processors to address the issues by using NFS to  
share the mailman archive storage space between them.

Paranthetically, I successfully ran Mailman on x86 Linux boxes  
entirely out of NFS mounted storage on enterprise level servers for a  
number years, primarily to provide for rapid-ish switchover to a  
backup server in the case of primary Mailman server hardware failure,  
which happened on several occasions. At the time I found that I had  
to limit NFS read/write transfer sizes on the Linux boxes to avoid  
problems in the Linux kernel locking associated with the NFS  
implementation then available. Nowadays I am running Mailman on  
Solaris 10 which has no such problems but I guess the Linux' NFS  
implementation has also improved in the meantime.

The simplest split you could consider is moving the htdig  
installation and workload to a separate machine. The Mailman/htdig  
integration patches support this configuration in conjunction with  
NFS sharing of the Mailman archives files if you look at the  
documentation here:

http://www.openinfo.co.uk/mm/patches/444884/install.html#rconfig

This configuration leaves one machine running Mailman and being  
responsible for providing access to archive material while a second  
machine does htdig's index maintenance. Mailman also "subcontracts"  
each index search requested by a user to the htdig machine but the  
URLs returned in the search results mean that the Mailman machines  
delivers the material from the archives, not the htdig machine.

The question you asked was how to move the pipermail archives to  
another system. Using NFS again, it might be possible to run some of  
Mailman's qrunners on one machine and others (for example, the  
archive runner) on a second to partition things but I have never had  
the time or energy to set up systems to explore the issues of such a  
configuration but somebody else may have pushed the envelope this  
way. As an aside, I would avoid like the plague NFS cross-mounting of  
volumes between machines in any configuration.

If you decide none of the above is appropriate to what you want to  
achieve and the way you want to achieve it then you may be asking the  
wrong question in my view. Maybe you should deploying a mailing list  
archiving system independent of Mailman

Re: [Mailman-Users] distributing Mailman between 2 systems

2006-06-04 Thread Mark Sapiro
Jim Popovitch wrote:

>I would like to move the pipermail archives to a different host then the 
>main Mailman system.  Specifically for better archive searching 
>performance with htdig.  Is this possible?


You have basically two issues. One is getting the archived messages
onto the other host, and the other is accessing them.

To get messages archived on (call it) archives.example.com, you could
have Mailman installed and running there. Then, for each list, you
could have the active version, say [EMAIL PROTECTED] be your
normal list with the addition of [EMAIL PROTECTED] as a member
and archiving turned off. Then you set up [EMAIL PROTECTED]
with [EMAIL PROTECTED] as the only member (with delivery
and password reminders disabled, to accept the relayed posts) and with
the archive settings you want.

There are probably other ways to do this using external archiver
settings on lists.example.com or by having archives on both hosts and
using rsync to update the archives host, but the above seems simple
enough.

Assuming you're doing the first method, you probably want to set
ARCHIVE_TO_MBOX = -1 in mm_cfg.py on lists.example.com so the archiver
doesn't do any archiving at all. Then you can put your own pages in
archives/private//index.html on lists.example.com. These
pages could redirect to the corresponding archive on
archives.example.com. Also, for public archives, you could set
PUBLIC_ARCHIVE_URL in mm_cfg.py on lists.example.com to point to
archives.example.com.

I haven't thought about this a lot. There may be other (better) ways.
See

for some additional info.

-- 
Mark Sapiro <[EMAIL PROTECTED]>   The highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan

--
Mailman-Users mailing list
Mailman-Users@python.org
http://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-users/archive%40jab.org

Security Policy: 
http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq01.027.htp