Hi All, I've read the new information about Structured Streaming in Spark, looks super great.
Resources that I've looked at - https://spark.apache.org/docs/latest/streaming-programming-guide.html - https://databricks.com/blog/2016/07/28/structured-streamin g-in-apache-spark.html - https://spark.apache.org/docs/latest/streaming-custom-receivers.html - http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2. 0/Structured%20Streaming%20using%20Python%20DataFrames%20API.html + YouTube videos from Spark Summit 2016/2017 So finally getting to my question: I have Python code that yields a Python generator... this is a great streaming approach within Python. I've used it for network packet processing and a bunch of other stuff. I'd love to simply hook up this generator (that yields python dictionaries) along with a schema definition to create an 'unbounded DataFrame' as discussed in https://databricks.com/ blog/2016/07/28/structured-streaming-in-apache-spark.html Possible approaches: - Make a custom receiver in Python: https://spark.apache.o rg/docs/latest/streaming-custom-receivers.html - Use Kafka (this is definitely possible and good but overkill for my use case) - Send data out a socket and use socketTextStream to pull back in (seems a bit silly to me) - Other??? Since Python Generators so naturally fit into streaming pipelines I'd think that this would be straightforward to 'couple' a python generator into a Spark structured streaming pipeline.. I've put together a small notebook just to give a concrete example (streaming Bro IDS network data) https://github.com/Kitwa re/BroThon/blob/master/notebooks/Bro_IDS_to_Spark.ipynb Any thoughts/suggestions/pointers are greatly appreciated. -Brian