I have created a PySpark Streaming application that uses Spark ML to
classify flight delays into three categories: on-time, slightly late, very
late. After an hour or so something times out and the whole thing crashes.

The code and error are on a gist here:
https://gist.github.com/rjurney/17d471bc98fd1ec925c37d141017640d

While I am interested in why I am getting an exception, I am more
interested in understanding what the correct deployment model is... because
long running processes will have new and varied errors and exceptions.
Right now with what I've built, Spark is a highly dependable distributed
system but in streaming mode the entire thing is dependent on one Python
PID going down. This can't be how apps are deployed in the wild because it
will never be very reliable, right? But I don't see anything about this in
the docs, so I am confused.

Note that I use this to run the app, maybe that is the problem?

ssc.start()
ssc.awaitTermination()


What is the actual deployment model for Spark Streaming? All I know to do
right now is to restart the PID. I'm new to Spark, and the docs don't
really explain this (that I can see).

Thanks!
-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Reply via email to