On 03/11/17 21:05, Azoff, Justin S wrote: > I've been thinking the same thing, but I hope it doesn't come to that. > Ideally people will be able > to scale their clusters by just increasing the number of data nodes without > having to get into > the details about what node is doing what. > > Partitioning the data analysis by task has been suggested.. i.e., one data > node for scan detection, > one data node for spam detection, one data node for sumstats.. I think this > would be very easy to > implement, but it doesn't do anything to help scale out those individual > tasks once one process can > no longer handle the load. You would just end up with something like the > scan detection and spam > data nodes at 20% cpu and the sumstats node CPU at 100%
I would keep the particular data-services scalable but allow the user to specify their distribution across the data nodes. As Jon already wrote, it could look like this (I added Spam and Scan pools): [data-1] type = data pools = Intel::pool [data-2] type = data pools = Intel::pool, Scan::pool [data-3] type = data pools = Scan::pool, Spam::pool [data-4] type = data pools = Spam:pool However, this approach likely results in confusing config files and, as Jon wrote, it's hard to define a default configuration. In the end this is an optimization problem: How to assign data-services (pools) to data nodes to get the best performance (in terms of speed, memory-usage and reliability)? I guess there are two possible approaches: 1) Let the user do the optimization, i.e. provide a possibility to assign data services to data nodes as described above. 2) Let the developer specify constraints for the data service distribution across data nodes and automatize the optimization. The minimal example would be that for each data service a minimum and maximum or default number of data nodes is specified (e.g. Intel on 1-2 nodes and Scan detection on all available nodes). More complex specifications could require that a data service isn't scheduled on data nodes together with (particular) other services. Another thing that might need to be considered are deep clusters. If I remember correctly, there has been some work on that in context of broker. For a deep cluster there might be even hierarchies of data nodes (e.g. root-intel-nodes managing the whole database and 2nd-level-data-nodes serving as caches for worker-nodes on per site level). Jan _______________________________________________ bro-dev mailing list [email protected] http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev
