On 4 Jun 2006, at 06:41, Jim Popovitch wrote: > I would like to move the pipermail archives to a different host > then the > main Mailman system. Specifically for better archive searching > performance with htdig. Is this possible? > > -Jim P.
How you approach this depends on what you perceive your problem to be and what you mean by "better archive searching performance with htdig". Like Google and other internet search engines, htdig splits the task into two parts: index construction and index search. Index construction does the heavy lifting of scanning the source material and squirreling away in its indices a lot of detail of which indexed source files contain what. This can be quite a slow process especially when a large body of material has to be initially scanned and indexed. It is probably best treated as a batch process run a times of light load from other work on the system doing it. Depending on the material concerned and how you configure htdig this indexing may produce very large indices which can come close to being in the same order of magnitude of storage size as the raw source material. Many lists with large indices can generate demand for much CPU and potentially much storage during indexing (and after in the case of storage). On the other hand index searching to produce a list of source files that match the search criteria induces a much lower load on the system concerned; after all it is just looking up words in pre-built search indices. The problem with this approach is that search indices are never completely up-to-the-minute; but consider how often does Google's crawler visit your web site. While updating search indices when new documents are added to the archive material should be less load- inducing than the original construction of the indices, configuring cron jobs so that htdig rebuilds it indices too frequently is not advisable. The updating of indices can still involve a lot of IO as htdig walks a lot of files to determine which of the existing material has been changed as well as what has been added as new. So first you should define what problem you are trying to solve as regards to using htdig before deciding what to do next. You could plan on having your HTML mail archives integrated with Mailman e.g. using pipermail or a pipermail/MHonArc synthesis for the archive pages and having htdig integrated with that; I know you are aware of the patches available to support this approach and that there are some benefits as regard archive privacy being maintained and such. I will deal with this integrated approach first. You could deploy multiple processors to address the issues by using NFS to share the mailman archive storage space between them. Paranthetically, I successfully ran Mailman on x86 Linux boxes entirely out of NFS mounted storage on enterprise level servers for a number years, primarily to provide for rapid-ish switchover to a backup server in the case of primary Mailman server hardware failure, which happened on several occasions. At the time I found that I had to limit NFS read/write transfer sizes on the Linux boxes to avoid problems in the Linux kernel locking associated with the NFS implementation then available. Nowadays I am running Mailman on Solaris 10 which has no such problems but I guess the Linux' NFS implementation has also improved in the meantime. The simplest split you could consider is moving the htdig installation and workload to a separate machine. The Mailman/htdig integration patches support this configuration in conjunction with NFS sharing of the Mailman archives files if you look at the documentation here: http://www.openinfo.co.uk/mm/patches/444884/install.html#rconfig This configuration leaves one machine running Mailman and being responsible for providing access to archive material while a second machine does htdig's index maintenance. Mailman also "subcontracts" each index search requested by a user to the htdig machine but the URLs returned in the search results mean that the Mailman machines delivers the material from the archives, not the htdig machine. The question you asked was how to move the pipermail archives to another system. Using NFS again, it might be possible to run some of Mailman's qrunners on one machine and others (for example, the archive runner) on a second to partition things but I have never had the time or energy to set up systems to explore the issues of such a configuration but somebody else may have pushed the envelope this way. As an aside, I would avoid like the plague NFS cross-mounting of volumes between machines in any configuration. If you decide none of the above is appropriate to what you want to achieve and the way you want to achieve it then you may be asking the wrong question in my view. Maybe you should deploying a mailing list archiving system independent of Mailman and you could do worse than look at the model set by http://www.mail-archive.com, as a starting point. ----------------------------------------------------------------------- Richard Barrett http://www.openinfo.co.uk ------------------------------------------------------ Mailman-Users mailing list Mailman-Users@python.org http://mail.python.org/mailman/listinfo/mailman-users Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-users/archive%40jab.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq01.027.htp