Re: Question about 'Structured Streaming'
> > 1) Parsing data/Schema creation: The Bro IDS logs have a 8 line header > that contains the 'schema' for the data, each log http/dns/etc will have > different columns with different data types. So would I create a specific > CSV reader inherited from the general one? Also I'm assuming this would > need to be in Scala/Java? (I suck at both of those :) > This is a good question. What I have seen others do is actually run different streams for the different log types. This way you can customize the schema to the specific log type. Even without using Scala/Java you could also use the text data source (assuming the logs are new line delimited) and then write the parser for each line in python. There will be a performance penalty here though. > 2) Dynamic Tailing: Does the CSV/TSV data sources support dynamic tailing > and handle log rotations? > The file based sources work by tracking which files have been processed and then scanning (optionally using glob patterns) for new files. There a two assumptions here: files are immutable when they arrive and files always have a unique name. If files are deleted, we ignore that, so you are okay to rotate them out. The full pipeline that I have seen often involves the logs getting uploaded to something like S3. This is nice because you get atomic visibility of files that have already been rotated. So I wouldn't really call this dynamically tailing, but we do support looking for new files at some location.
Re: Question about 'Structured Streaming'
I can see your point that you don't really want an external process being used for the streaming data sourceOkay so on the CSV/TSV front, I have two follow up questions: 1) Parsing data/Schema creation: The Bro IDS logs have a 8 line header that contains the 'schema' for the data, each log http/dns/etc will have different columns with different data types. So would I create a specific CSV reader inherited from the general one? Also I'm assuming this would need to be in Scala/Java? (I suck at both of those :) 2) Dynamic Tailing: Does the CSV/TSV data sources support dynamic tailing and handle log rotations? Thanks and BTW your Spark Summit talks are really well done and informative. You're an excellent speaker. -Brian On Tue, Aug 8, 2017 at 2:09 PM, Michael Armbrustwrote: > Cool stuff! A pattern I have seen is to use our CSV/TSV or JSON support to > read bro logs, rather than a python library. This is likely to have much > better performance since we can do all of the parsing on the JVM without > having to flow it though an external python process. > > On Tue, Aug 8, 2017 at 9:35 AM, Brian Wylie > wrote: > >> Hi All, >> >> I've read the new information about Structured Streaming in Spark, looks >> super great. >> >> Resources that I've looked at >> - https://spark.apache.org/docs/latest/streaming-programming-guide.html >> - https://databricks.com/blog/2016/07/28/structured-streaming- >> in-apache-spark.html >> - https://spark.apache.org/docs/latest/streaming-custom-receivers.html >> - http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Stru >> ctured%20Streaming%20using%20Python%20DataFrames%20API.html >> >> + YouTube videos from Spark Summit 2016/2017 >> >> So finally getting to my question: >> >> I have Python code that yields a Python generator... this is a great >> streaming approach within Python. I've used it for network packet >> processing and a bunch of other stuff. I'd love to simply hook up this >> generator (that yields python dictionaries) along with a schema definition >> to create an 'unbounded DataFrame' as discussed in >> https://databricks.com/blog/2016/07/28/structured-streaming- >> in-apache-spark.html >> >> Possible approaches: >> - Make a custom receiver in Python: https://spark.apache.o >> rg/docs/latest/streaming-custom-receivers.html >> - Use Kafka (this is definitely possible and good but overkill for my use >> case) >> - Send data out a socket and use socketTextStream to pull back in (seems >> a bit silly to me) >> - Other??? >> >> Since Python Generators so naturally fit into streaming pipelines I'd >> think that this would be straightforward to 'couple' a python generator >> into a Spark structured streaming pipeline.. >> >> I've put together a small notebook just to give a concrete example >> (streaming Bro IDS network data) https://github.com/Kitwa >> re/BroThon/blob/master/notebooks/Bro_IDS_to_Spark.ipynb >> >> Any thoughts/suggestions/pointers are greatly appreciated. >> >> -Brian >> >> >
Re: Question about 'Structured Streaming'
Cool stuff! A pattern I have seen is to use our CSV/TSV or JSON support to read bro logs, rather than a python library. This is likely to have much better performance since we can do all of the parsing on the JVM without having to flow it though an external python process. On Tue, Aug 8, 2017 at 9:35 AM, Brian Wyliewrote: > Hi All, > > I've read the new information about Structured Streaming in Spark, looks > super great. > > Resources that I've looked at > - https://spark.apache.org/docs/latest/streaming-programming-guide.html > - https://databricks.com/blog/2016/07/28/structured- > streaming-in-apache-spark.html > - https://spark.apache.org/docs/latest/streaming-custom-receivers.html > - http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/ > Structured%20Streaming%20using%20Python%20DataFrames%20API.html > > + YouTube videos from Spark Summit 2016/2017 > > So finally getting to my question: > > I have Python code that yields a Python generator... this is a great > streaming approach within Python. I've used it for network packet > processing and a bunch of other stuff. I'd love to simply hook up this > generator (that yields python dictionaries) along with a schema definition > to create an 'unbounded DataFrame' as discussed in > https://databricks.com/blog/2016/07/28/structured- > streaming-in-apache-spark.html > > Possible approaches: > - Make a custom receiver in Python: https://spark.apache. > org/docs/latest/streaming-custom-receivers.html > - Use Kafka (this is definitely possible and good but overkill for my use > case) > - Send data out a socket and use socketTextStream to pull back in (seems a > bit silly to me) > - Other??? > > Since Python Generators so naturally fit into streaming pipelines I'd > think that this would be straightforward to 'couple' a python generator > into a Spark structured streaming pipeline.. > > I've put together a small notebook just to give a concrete example > (streaming Bro IDS network data) https://github.com/ > Kitware/BroThon/blob/master/notebooks/Bro_IDS_to_Spark.ipynb > > Any thoughts/suggestions/pointers are greatly appreciated. > > -Brian > >