Yep Phil, if you recall we worked together with you on the Ford project. I
was new to the team at the time. Since then, I have very nearly rewritten
Xagent too (although likely differently than you, as our internal
requirements aren't the same as yours were). The worker I am talking about
is essentially the same as your "dealer", but with a twist in that it
handles distribution list expansion as well. It is no fun when a huge wad of
small notes arrives each of which has to be copied to hundreds of users ...
you can use up the 9999 very quickly, especially because the workers that
the dealer is giving the mail to have to do full MIME decoding for a sizable
fraction of the incoming mail. Those decoding workers can handle about 10
emails per second, but if they are arriving at the rate of 30-40 per second
and there's thousands of them that arrive all-at-once, you can see the
problem. I have to dole out the traffic to the decoding workers in such a
way that none of them exceed their limit (or that any one of them sits idle
while another is going gangbusters). To me that means distributing the
incoming load as evenly as possible among them and the random selection
approach seemed the easiest (sure I could do round-robin too but there's
more housekeeping for that). I also need to add logic to the dealer to start
deferring mail headed to large distribution lists (in favor of mail going to
smaller numbers of recipients) and retrying those when things get quieter
(and of course since I'd prefer to use the reader to hold the deferred mail,
I need to worry about what to do if the dealer itself begins to approach the
9999 limit). Anyway, I really didn't want to get into all this. I just was
looking for suggestions how to do the random selection and I have that now
(and it works nicely), so there's not much point in continuing to elaborate.
This could all have been done better I am sure, but we've got 20 years of
legacy requirements to continue to satisfy and nobody is interested in
funding any major redesigns/rewrites, so we live with what we've got and
when changes are needed, transparency is key and that tends to limit how
adventurous we can get.
--
bc


On Thu, Jan 6, 2011 at 12:25 AM, Phil Smith III <li...@akphs.com> wrote:

> Bob Cronin wrote:
> >There are 9 physical systems involved planet-wide. The number of CPU's per
> >system varies from 2 to 10 depending on which system we're talking about.
> >Some of the systems are old 2084's. Some are 2094's. Each system has up to
> 4
> >inbound SMTP's which feed up to 4 intermediary workers. The intermediary
> >creates a single copy of each mail file for each RCPT TO and sends it on
> to
> >another worker for delivery (either back out via SMTP or converted to
> >mainframe mail format and delivered over RSCS). Some of the RCPT TO
> >addresses can be distribution lists, which are expanded by the
> intermediary
> >in the copy-making process. The distribution lists can be large (and
> worse,
> >sometimes we get floods of distlist-bound mail which have more than once
> >resulted in the 9999 spool file limit being exceeded on the workers that
> >handle the ultimate delivery). The last time that happened we had 4
> delivery
> >workers on most of the larger systems. We've since doubled that on each
> >system. Combined, these systems are now handling between 30 and 40 million
> >emails a month. Last year at this time that was more like 15 million, so
> >we're experiencing rapid growth. The per-minute arrival rate is typically
> >between 750 and 1500 worldwide, with much (much) higher rates during
> >distlist-bound-mail floods. I am at present working to devise a strategy
> for
> >dealing with the floods by slowing down the distlist expansion when the
> >number of spool files in the delivery worker's spool is getting
> dangerously
> >high. A part of that is intelligently distributing the load amongst the
> >workers so as to minimize the chances of any one of them hitting the 9999
> >limit. Hence my question ...
>
> Boy, this all sounds familiar -- the mail gateway I did at Ford in 1999 had
> all of these issues. We were handling about 250,000 notes/day, so a bit
> under a quarter of what you're seeing--but on 9672-class hardware. The I/O
> is faster, too, but the hardware is much more than 4x as fast, so it's sorta
> kinda comparable. (In fact, the more I read this, the more I think it's
> probably XAGENT, which is what we started with at Ford but almost totally
> rewrote.)
>
> My advice (worth what you pay for it) is not to worry about trying to
> utilize the machines evenly, but rather to worry about the ones that are
> "getting behind". We used a machine called the "dealer", who did nothing
> except "deal" the incoming notes to the worker machines, based on queue
> length. We used QUERY FILES, and had a threshold -- "if the next machine [we
> were nominally round-robining] has fewer than 250 files in its queue, give
> it the next note" (can't swear it was 250, but it was some number). This
> avoided some of the issues you're seeing, and also meant that a 'stuck'
> machine, or one that happened to get hit with a bunch of big notes that took
> a while to process, wouldn't keep getting worse. I think there was
> elasticity such that if they were ALL at 250, it would shrug and start
> dealing evenly.
>
> It also meant that if a particular worker machine is being fixed for some
> reason, it would still get notes in its queue (up to the threshold), so once
> it was fixed it would be useful instantly. We were pretty happy with how
> this all worked.
>
> So the flow was like this (each line below shows one hop, since Visio this
> ain't):
>
> Internet->SMTP     - several SMTPs, load balanced via TCP/IP
>
> SMTP->dealer       - one dealer; it wasn't doing much, just WAKEUP, QUERY
> FILES, and TRANSFER, so it was able to keep up
>
> dealer->translator - several translator machines, each of which would
> disassemble each note, decide where the ultimate disposition was supposed to
> be (see below), reformat it, and send it on
>
> So it was many-to-one-to-many. And I don't remember for sure but I suspect
> this was the original XAGENT architecture, so maybe this is exactly what
> you're talking about.
>
> I have (semi-)fond memories of fixing a problem at 3AM from home, working
> over dialup, as Europe woke up and the mail volume started to pick up -- and
> getting close to that 9999 limit, but starting things up again before it got
> hit! And we definitely learned that most of the email was jokes and pr0n.
>
> This solution replaced a hardware-based gateway that was not going to
> survive the Y2K transition. I remember the first time I fired up a
> translator with the full configuration -- 400,000 records of "address A goes
> to PROFS", "address B goes to Exchange", "address C goes to some UNIX
> system", etc. -- it took 90 seconds to initialize. I went to the project
> manager, hat in hand, and told him. He laughed uproariously, then told me
> that the hardware gateway, with far less functionality, took 45 minutes to
> start up, so we could deal with 90 seconds.
>
> Calendar entries (vCal and iCal) -- those were even more fun.
>
> Good times!
>
> ...phsiii
>

Reply via email to