Hi All,

I've read the new information about Structured Streaming in Spark, looks
super great.

Resources that I've looked at
- https://spark.apache.org/docs/latest/streaming-programming-guide.html
- https://databricks.com/blog/2016/07/28/structured-streamin
g-in-apache-spark.html
- https://spark.apache.org/docs/latest/streaming-custom-receivers.html
- http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.
0/Structured%20Streaming%20using%20Python%20DataFrames%20API.html

+ YouTube videos from Spark Summit 2016/2017

So finally getting to my question:

I have Python code that yields a Python generator... this is a great
streaming approach within Python. I've used it for network packet
processing and a bunch of other stuff. I'd love to simply hook up this
generator (that yields python dictionaries) along with a schema definition
to create an  'unbounded DataFrame' as discussed in https://databricks.com/
blog/2016/07/28/structured-streaming-in-apache-spark.html

Possible approaches:
- Make a custom receiver in Python: https://spark.apache.o
rg/docs/latest/streaming-custom-receivers.html
- Use Kafka (this is definitely possible and good but overkill for my use
case)
- Send data out a socket and use socketTextStream to pull back in (seems a
bit silly to me)
- Other???

Since Python Generators so naturally fit into streaming pipelines I'd think
that this would be straightforward to 'couple' a python generator into a
Spark structured streaming pipeline..

I've put together a small notebook just to give a concrete example
(streaming Bro IDS network data) https://github.com/Kitwa
re/BroThon/blob/master/notebooks/Bro_IDS_to_Spark.ipynb

Any thoughts/suggestions/pointers are greatly appreciated.

-Brian

Reply via email to