> 
> On Jul 10, 2016, at 1:06 AM, Johanna Amann <joha...@icir.org> wrote:
> 
> First - wouldn't this be a topic for bro-dev instead of bro-blue? I don't
> really see a reason to keep this private.

Indeed.. Moving this to bro-dev as there aren't any more internal logs or 
addresses.

To catch everyone up, I noticed some performance issues with how scan.bro and 
sumstats interacts.
scan.bro creates a large number of unique sumstats keys, and the mechanism that 
the manager uses to fetch them from the workers is not very efficient.  I 
implemented some ideas to improve things and then realized that I basically 
re-implemented how it used to work in 2013.

> In any case... Seth might remember this better, but as far as I remember,
> we had some huge, quite difficult to debug Problems at bigger sites (I
> think especially at Indiana), when running Bro with the old code that used
> batching. I think I remember something about this causing _huge_ memory
> spikes in those circumstances and that the best way around that was this
> switch.

I could see a batch size of 50 being a problem, but if batching was the 
problem, simply setting the batch size to 1 would be better than what we have 
now.

If worker1 has key1 and worker2 has key2, right now what happens is:

manager send out get_a_key to get a key from each worker
worker1 will send_a_key for key1
worker2 will send_a_key for key2

At this point stats_keys[uid] contains [key1, key2]

Now, this is where everything goes wrong

manager sends out cluster_get_result for key1
worker1 sends out cluster_send_result reply for key1
worker2 sends out cluster_send_result empty reply

manager sends out cluster_get_result for key2
worker1 sends out cluster_send_result empty reply
worker2 sends out cluster_send_result reply for key2

With 56 workers, you end up with

manager sends out cluster_get_result for key1
worker1 sends out cluster_send_result reply for key1
worker2 sends out cluster_send_result empty reply
worker3 sends out cluster_send_result empty reply
worker4 sends out cluster_send_result empty reply
...
worker56 sends out cluster_send_result empty reply

manager sends out cluster_get_result for key2
worker1 sends out cluster_send_result empty reply
worker2 sends out cluster_send_result reply for key2
worker3 sends out cluster_send_result empty reply
worker4 sends out cluster_send_result empty reply
...
worker56 sends out cluster_send_result empty reply

For 56 workers to send the SAME key up to the manager you will get 1 get_a_key, 
56 send_a_key, 1 cluster_get_results, and 56 cluster_send_result events.  This 
is the best case scenario for the current system.  This works ok if your keys 
are things like country codes or mime types that do not grow unbounded.  There 
is a little overhead, but not much.

However, for 56 workers to send 56 different keys up to the manager you get 1 
get_a_key, 56 send_a_key, 56 cluster_get_results, and 3136 cluster_send_result 
events.  This is the worst case scenario and is what scan.bro triggers.


> You also have to be a bit careful when going back to old code (or changing
> the sumstats code) - the code has a bit of a... sad interaction with the
> Bro message cache (or whatever it is called) - if you forget to call
> copy() at all the right places that exchange messages about data in
> tables, you are not actually going to exchange data but just references,
> which can lead to stale data on the manager (and also reduce message load
> as a side effect - while leading to wrong results). I am not sure if that
> is the case here, I am just saying you have to be quite careful changing
> things :)

I saw those copy()'s.. I didn't understand them but I left them alone :-)

> And - one thing in your older email - removing the Cluster::worker_count
> == done_with[uid]  is also a bit problematic because it makes it difficult
> to check the correctness of the results. Which can become an issue with
> sumstats - sometimes single nodes reply surprisingly slowly.

Yeah.. I realized that was important.  What my code currently does is:

When the manager wants the results, it sends out a single get_some_key_data 
event.  This is similar to the old send_data event.

get_some_key_data sends one key to the manager using the existing 
cluster_send_result and does a 

    schedule 0.001 sec { SumStats::get_some_key_data(uid, ss_name, cleanup) };

to re-call itself if there is more data to be sent.  When there is no more data 
for the current ss_name, it sends a 'send_no_more_data' event up to the manager.

I moved the Cluster::worker_count check to count the send_no_more_data events.  
So rather than being done once per key, the unit of work is the entire table.  
I believe this is almost identical to the 2013 code.

Compared to the 3000+ events before, for 56 workers to send 56 different keys 
up to the manager this mechanism uses 1 get_some_key_data, 56 
cluster_send_result events, and 56 send_no_more_data events.  The 
send_no_more_data count will always be 56, so the overhead is a small constant.

It is possibly that the more efficient method of transferring data up to the 
manager was what was causing the memory spikes. I think the current code may 
appear to behave better, but it is also spending 97.5% of its time sending 
around extra events and never making any progress.

It may be that the reason it was changed to transfer one key at a time was so 
that bro would never have to build a copy of the entire sumstat table in memory 
on the manager.  Fixing that issue and the event amplification at the same time 
would be a little harder.  I know two ways to solve that problem:

* Do an n-way merge of streams of sorted keys from each worker.. implementing 
that in bro script would not be fun.

* Shard the sumstats table itself.  If each sumstats table was first bucketed 
by a hash of the key on each worker into 37 buckets, you could transfer and 
process each bucket serially which would cut the memory usage on the manager to 
1/37.  This is probably really easy to implement.  Not sure how the hash would 
be computed in bro script, but the rest is trivial.



Oh, and I believe I also found a small inefficiency with how 
cluster_key_intermediate_response works, the recent_global_view_keys is on the 
worker, so each worker can independantly kick off a 
cluster_key_intermediate_response for the same key.  This small patch keeps 
track of the recent_global_view_keys on the manager too and should cut down on 
repeated events:

+global recent_global_view_keys: table[string, Key] of count 
&create_expire=1min &default=0;
# Managers handle intermediate updates here.
event SumStats::cluster_key_intermediate_response(ss_name: string, key: Key)
       {
       #print fmt("MANAGER: receiving intermediate key data from %s", 
get_event_peer()$descr);
       #print fmt("MANAGER: requesting key data for %s", key);
+       if ( [ss_name, key] in recent_global_view_keys )
+               return;

       if ( ss_name in outstanding_global_views &&
            |outstanding_global_views[ss_name]| > max_outstanding_global_views )
@@ -451,6 +458,7 @@
               return;
               }

+       ++recent_global_view_keys[ss_name, key];
       ++outstanding_global_views[ss_name];

       local uid = unique_id("");


-- 
- Justin Azoff


_______________________________________________
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev

Reply via email to