> > 1) Parsing data/Schema creation: The Bro IDS logs have a 8 line header > that contains the 'schema' for the data, each log http/dns/etc will have > different columns with different data types. So would I create a specific > CSV reader inherited from the general one? Also I'm assuming this would > need to be in Scala/Java? (I suck at both of those :) >
This is a good question. What I have seen others do is actually run different streams for the different log types. This way you can customize the schema to the specific log type. Even without using Scala/Java you could also use the text data source (assuming the logs are new line delimited) and then write the parser for each line in python. There will be a performance penalty here though. > 2) Dynamic Tailing: Does the CSV/TSV data sources support dynamic tailing > and handle log rotations? > The file based sources work by tracking which files have been processed and then scanning (optionally using glob patterns) for new files. There a two assumptions here: files are immutable when they arrive and files always have a unique name. If files are deleted, we ignore that, so you are okay to rotate them out. The full pipeline that I have seen often involves the logs getting uploaded to something like S3. This is nice because you get atomic visibility of files that have already been rotated. So I wouldn't really call this dynamically tailing, but we do support looking for new files at some location.