Thanks Jay. Regarding the question of disk placement, that would probably depend on how hard you're pushing the machines. In our case, we have a single collector handling 60 routers. That's been working just fine, but the collector is doing plenty of time-sensitive I/O, so it might make sense to leave the disk there. Also, our nodes will be connected with Gb Ethernet.
The bigger question is how to manage the cluster's workload. We considered the static approach, as you suggested below, where each node is assigned to a dedicated list of routers. The concern is scalability since router loads change over time. The process-migration or load-balancing solutions are nicer because they distribute load dynamically, but I guess they're unreasonable with I/O. We might go with the static approach after all. As far as you know, are all clustering solutions inherently I/O inefficient, or are some of them ok? Did you check out openSSI? By the way, we're running reports on daily aggregations. So on some of the routers we can have 2GB of compressed data for a single report. Sometimes these processes run out of memory and sometimes they take 4 hours. I'm probably pushing this system more than I should, but the results are usually pretty good, I just need more processing power. On that note, does anybody know about the inner workings of flow-report? My general understanding is that it loads up a huge hashtable (or other data structure) in memory and then basically dumps out quick stats. Not very cpu intensive. Is that accurate? Ari -----Original Message----- From: Jay A. Kreibich [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 27, 2004 4:12 PM To: Ari Leichtberg Cc: [EMAIL PROTECTED] Subject: Re: [Flow-tools] Flow-tools on linux cluster (Mosix) On Tue, Oct 26, 2004 at 05:09:04PM -0400, Ari Leichtberg scratched on the wall: > I'm wondering if anybody has any experience running flow-tools on a > Linux cluster. I have a dedicated Sun box running flow-capture, > collecting from around 60 Cisco's campus wide, totaling over 16 GB of > level-6 compressed data per day. The flows are written to the > collector's local storage, and I have enough space to hold around 12 > day's worth of data. We're only collecting off our exit routers and do ~14GB per day, although that's uncompressed. > My plan is to have a separate Linux cluster, nfs mounted to the > collector's storage, which runs daily and hourly flow-reports, > flow-dscans, and other analyses. It's not uncommon for a router to > collect over 2GB per day, so the flow-report processes get pretty IO and > memory heavy. Consider this: what requires more disk I/O, the collector, which has an hour to do one pass on one hour's worth of data; or the analyzers, that have one hour to do all of your reports. Often reports require multiple passes and ideally don't take the whole hour. With that in mind, if you are going to write everything to disk and then do post-analysis, put the disk on the analyzers, not the collectors. They do even more I/O and will benefit a lot more from the direct disk attachment. You definitely don't want the collector wasting lots of resources doing NFS server traffic! In the bigger picture, one of the problems with clusters for flow analysis is volume of data involved. Most people run reports that are fairly simple, so they tend to be I/O bound (or compression processing bound) on any modern machine. This is worst case for cluster stuff since clusters inherently add I/O inefficiencies, so for I/O bound stuff you can actually make everything run slower on a cluster, although the compression helps a little there. > Has anybody ever tried this with Mosix, or any other ideas for a > clustering solution? Because of these problems, one of the things we're looking at is putting in a SAN infrastructure with a clustered file system so that multiple machines can access the same fibre-channel attached filesystem at the same time. The collector writes and everything else does what it needs. More or less what you're talking about, but using fibre-channel rather than NFS. Once you remove most of the file transport problems, how you want to split up or distribute your computation is up to you. We're looking at static assignments, not load balanced clustering, mostly because we aren't looking at process migration type stuff. The other option is to pre-distribute the data. Have one (or more) collectors with big disk that is your main collectors are archivers. Configure them to filter and split their data streams up to multiple machines in the cluster. Have each cluster node keep only one to two hours worth of data, or better yet, do the reports in real time so they need almost no storage at all. The constant data rates are not exciting-- even with spikes you're only looking at like 45Mb/s. If you back-channel multicast the data across the cluster, that's no problem. If you pre-filter it, so each cluster node only services a few routers, it is even easier. There are lots of games to play here, but the big thing is to remember that collection data rates are almost always smaller than required analysis data rates. I should also say that we use a custom collector and tool set, so I have no idea how easy/hard it would be to do some of these things with the public tools. -j -- Jay A. Kreibich | Comm. Technologies, R&D [EMAIL PROTECTED] | Campus IT & Edu. Svcs. <http://www.uiuc.edu/~jak> | University of Illinois at U/C _______________________________________________ Flow-tools mailing list [EMAIL PROTECTED] http://mailman.splintered.net/mailman/listinfo/flow-tools
