Re: [zeromq-dev] need help with linux platform

Chuck Remes Wed, 16 Feb 2011 06:47:34 -0800

On Feb 16, 2011, at 6:54 AM, Ian Barber wrote:

> 
> On Wed, Feb 16, 2011 at 2:30 AM, Chuck Remes <[email protected]> wrote:
> So, it's trying to expand the buffer past 10MB and fails. The buffer starts 
> out at 512k (new default I set) and grows very rapidly. What kind of data is 
> 0mq putting on this internal socket that could grow to this size so quickly?
> 
> I did track down the code in this component that appears to trigger it. It's 
> publishing as fast as it can to a PUB socket. The socket is allocated with 
> its default settings (HWM, etc) so it should be able to grow to the size of 
> memory if the subscribers are too slow. This box has 12GB RAM and is nowhere 
> near its limit.
> 
> Suggestions?
> 
> cr
> 
> Hi Chuck,  
> 
> I saw some similar stuff a while back due to two sockets with the same 
> identity being connected at the same time (which shouldn't be able to happen, 
> but did). Just as a though - is there any chance something like that could be 
> happening, as at least then there might be a work around.


Interesting idea. I do set all of my identities, but each one uses a string 
prepended with a random number (0 to 999_999_999) so collisions are extremely 
unlikely. I'll audit that code to make sure it isn't happening though.


In any event, I *think* I may have resolved this particular issue. Let me 
explain in detail the entire setup, my hypothesis for what was occurring, and 
how I solved it.

Purpose:
The system of distributed components in a trading system "back tester" that 
allows simulation of trading ideas (strategies) against historical data. The 
historical data can either be tick level data (hundreds/thousands of ticks per 
second) or bar data (summarized tick data at a 1 minute, 5 minute, X minute 
level).

To backtest a trading idea against data going back as far as 2002, the system 
must be able to run at faster than "real time" otherwise it would take 9 years 
to backtest a single strategy (i.e. 1 second of tick data would take 1 second 
of real time to process). To accomplish faster than real-time, the distributed 
components use a simulated clock signal that is broadcast amongst the 
components. The signal is incremented in lock-step with the most recently 
published tick data (which contains a timestamp) so that component A has the 
same clock time as the other components. 

To make sure the clock signal advances correctly, each component must receive 
all tick timestamps and agree that they have the same one before they signal 
the clock component that it may advance the clock.

All socket handling uses the reactor pattern so all sends/recvs are NOBLOCK and 
occur within a single thread. A component/process may have multiple threads 
*each* with their own reactor. All cross-thread communication is accomplished 
via 0mq sockets to there aren't any user code mutexes.

Problem:
One of the components is the tick broadcaster. It receives requests for a 
specific contract_id and timestamp and then begins publishing every tick of 
that contract. This component has 4 threads, each running its own reactor. 
Requests are load balanced across all 4 using the XREQ/XREP load balancing 
semantics through an intermediate QUEUE device. All pretty standard stuff.

Each reactor has a non-blocking client that subscribes to all tick data via a 
SUB socket and a subscription topic of '' (everything). It drops/closes all 
messages except for the last one read before EAGAIN and decodes it for its 
timestamp. It matches that timestamp against what it expects and either signals 
the clock component (another process) to advance or it goes back to read more 
ticks to find the appropriate timestamp. The subscriber should be faster than 
the publisher because it *only* decodes the last message whereas the publisher 
has to encode everything for transmission. Subscribers do way less work.

I have special logic for handling weekends. I deal with futures markets to they 
close at 5pm on Friday afternoon and don't reopen again until around 4pm on 
Sunday. I need to have the clock automatically skip forward so I don't process 
all of those "dead" seconds on Friday night, Saturday and Sunday morning. I had 
a flaw in this skip logic that skipped the clock forward to Monday instead of 
Sunday. So, instead of having to encode and publish 1m of new tick data on 
Sunday, it was publishing 24hours worth of tick data (on Monday) in one tight 
loop before it would allow the clock to advance again.

So the tick broadcaster was in a tight (blocking) loop to encode and publish a 
few million ticks to 4 subscribers within the same process as well as to a few 
other subscribers in other processes/components. This is exactly where it died.

Hypothesis:
As a rule, reactors aren't supposed to "block" except for very short periods of 
time to accomplish their work. Since each "reactor tick" could publish its 
usual 1 second (or 1 minute) of tick data in a few dozen milliseconds, this 
"blocking" really didn't last long. When it had to publish 24 *hours* worth of 
tick data, it was now blocking for tens of seconds before letting the reactor 
loop recycle and allow all registered sockets to get serviced.

1. The publisher pushed too much data to its socket. HWM was set to default, so 
it had no limit other than memory (many GBs which I could see from monitoring 
were not being consumed). One of the subscribers shares the same reactor loop 
as the publisher, so it didn't get a chance to read anything off of its SUB 
socket *until the PUB was done*. The commands for it backed up and overflowed 
the mailbox.

or

2. The other 5 subscribers (3 more in process, 2 in other processes) were too 
slow in reading from the single publisher. Commands in their mailboxes backed 
up and overflowed the mailbox.

My money is on #1. It kind of feeds into Martin Sustrik's explanation about how 
mailboxes are used for signaling between I/O threads and to user threads. If 
none of them call a 0mq function for a while, the commands can back up. In my 
case, the publisher *was* calling zmq_send() but one of the receivers was on 
the *same* thread and had no opportunity to call zmq_recv() until the publisher 
was done sending millions of messages.

The Fix:
I fixed it by finally detecting and repairing the "weekend clock skip" bug. It 
no longer tries to publish 24 hours worth of ticks in one shot. The *most* it 
will try to publish at once is 1 minute.

I let it run overnight with the fix and it didn't assert at all. <sigh of 
relief>

So it's not really "fixed" in the 0mq library; I just avoid the condition that 
overruns the OS buffers and causes the library to abort.

Still, this seems like a use-case that 0mq should be able to handle without 
aborting. It appears to stem from its use of a socketpair which has limits set 
by the OS. Perhaps it should replace the socketpair with an internal memory 
queue of some nature to accomplish the same task. Then the queue size would be 
limited only by available heap space.

Thank you to all of those on the list who participated in this thread or 
emailed me off-list. I appreciated your ideas. It would have taken me far 
longer to resolve this problem without your suggestions. Many, many thanks.

cr

_______________________________________________
zeromq-dev mailing list
[email protected]
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Re: [zeromq-dev] need help with linux platform

Reply via email to