> On May 20, 2016, at 4:52 PM, Seth Hall <[email protected]> wrote:
>
> For the 2.5 release, we were hoping to understand why the 
> topic/seth/remove-flare fixes some issues that people have been seeing with 
> the communication code.  Perhaps even more to the point we are aiming to 
> understand why that branch fixes the problem, but Robin's branch 
> topic/robin/no-flares-2.4.1 doesn't work.
>
> The problem that we've seen will exhibit on Linux (for some reason FreeBSD 
> doesn't seem to be affected) and you will see high memory use on the child of 
> your manager process.  People will tend to notice it in two ways.
> 1. Memory exhaustion
> 2. Logs being written that are seconds to minutes old.
>
> This isn't exactly a request for anyone to do anything, but more a call for 
> anyone that would like to dig around in the core to figure out what is going 
> on here so we can get a fix merged into master.
>
> Thanks!
>  .Seth

I had looked into it a while ago.. I don't think the differences in your 
branches has anything to do with flares...

$ git diff  origin/topic/robin/no-flares-2.4.1  origin/topic/seth/remove-flare 
src/iosource/Manager.cc
diff --git a/src/iosource/Manager.cc b/src/iosource/Manager.cc
index 80fa5fe..5ad8cca 100644
--- a/src/iosource/Manager.cc
+++ b/src/iosource/Manager.cc
@@ -96,8 +96,8 @@ IOSource* Manager::FindSoonest(double* ts)
        // return it.
        int maxx = 0;

-       if ( soonest_src && (call_count % SELECT_FREQUENCY) != 0 )
-               goto finished;
+//     if ( soonest_src && (call_count % SELECT_FREQUENCY) != 0 )
+//             goto finished;

        // Select on the join of all file descriptors.
        fd_set fd_read, fd_write, fd_except;


$ git diff  origin/topic/robin/no-flares-2.4.1  origin/topic/seth/remove-flare 
src/RemoteSerializer.cc
[snip]

-               // FIXME: Fine-tune this (timeouts, flush, etc.)
-               struct timeval small_timeout;
-               small_timeout.tv_sec = 0;
-               small_timeout.tv_usec =
-                       io->CanWrite() || io->CanRead() ? 1 : 10;
-
-#if 0
-               if ( ! io->CanWrite() )
-                       usleep(10);
-#endif
-
-               int a = select(max_fd + 1, &fd_read, &fd_write, &fd_except,
-                               &small_timeout);
+               struct timeval timeout;
+               timeout.tv_sec = 1;
+               timeout.tv_usec = 0;

-               if ( a == 0 )
-                       ++timeouts;
+               int a = select(max_fd + 1, &fd_read, &fd_write, &fd_except, 
&timeout);


Seths branch removes the SELECT_FREQUENCY check and defaults the serializer 
'small timeout' to 1 full second.  Robins branch still has the SELECT_FREQUENCY 
check and has the small timeout set to 1 or 10 microseconds.  I think the two 
extra changes in Seths branch combine to make bro spend more time in the 
RemoteSerializer code.

When I was trying to figure some of this out I believed that many of these 
constants were part of the issue.  All the different places calling select with 
different timeouts and different frequencies causing bro to spend more time 
calling select than it was actually moving bytes around.  The only thing I ever 
really found wrong with the flare code was that repeated fire/extinguishes were 
not No-Ops and I had a small patch that improved that without changing anything 
else (attached).

I think Robins branch doesn't fix the problem because I don't think the flares 
were really the issue.. I think bro started having issues because between 2.3 
and 2.4 traffic volumes increased, cluster sizes increased, and we added a ton 
of new analyzers and log files which put even more strain on the communication 
system.






--
- Justin Azoff

Attachment: flare_fix.patch
Description: flare_fix.patch

_______________________________________________
bro-dev mailing list
[email protected]
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev

Reply via email to